如何解决最好的Python循环系统,用于合并熊猫DataFrame行以进行导出 IMdB来源 IMdB变换 OMDB采购样本分析输出
我是一名自学数据科学的学生,目前正在分多个步骤完成我的第一个大型Python档案项目,其中第一步是使用熊猫与IMDb [Internet Movie Database]的rather oddly structured .tsv file s努力创建一个包含所有IMDb数据的完全可搜索的大数据存储库(受正式支持的搜索,甚至是OMDB (Open Movie Database)之类的API都不允许我为大型项目进行各种详细的查询)。>
IMDb的公共文件的结构是,它们包括电影,电视节目,剧集,演员,导演,剧组,整个业务中的所有数据,它们随意地散布在七个庞大的tsv文件中。我已经确认熊猫实际上可以读取所有这些数据,并且我的计算机的内存可以处理它,但是我要做的是将七个tsv文件合并到一个DataFrame对象中,然后可以将其导出为(最好)是一个sql数据库,甚至是一个巨大的电子表格/另一个TSV文件,但更大。
数据库中的每个事物(电影,演员,单个电视剧集)都有一个tconst行,该行在一个文件中被标识为“ titleId”,是一个字符串。在每个其他文件中,它都被标识为“ tconst”,也是一个字符串。当我将该文件读入tconst时,我将需要更改titleId。这是我还没有遇到的几个挑战之一。#set pandas formatting parameters
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',25)
#read in the data tables provided
showbiz_core = pd.read_table("name.basics.tsv",sep='\t')
#temporary hack - print the entire dataframe as test
print(showbiz_core)
这可行,但是我不确定下一步如何继续。我想导入每个其他tsv文件,以尝试在本地重建imdb数据库。这意味着我不想有重复的tconst字符串,而最终要获得关于tconst条目(如电影)的新信息,并以新列的形式附加到它。
我应该以某种方式进行“为[新文件]中的我”类型的循环吗?您将如何处理?
解决方法
IMdB文件实际上是高度结构化的。循环始终是合并数据的不良结构
- 结构数据采购-我使用
wget
而不是手动采购 - 文件很大,因此可以使用子集进行建模。我只是用热门电影和演员当司机 tsv 文件中的
- CSV列实际上是子表。这样对待他们。我为此建立了一个参考实体 dmi
- 那里还有其他关联关系, primaryProfession ,流派 最后,将OMDB和IMdB中的所有内容最终结合在一起(合并)。采取第一行,其中许多项目与标题关联
我目前留下的数据为 tsv ,显然,使用to_sql()
方法将其放入数据库非常简单。要点是采购和转型。又名ETL,这已成为不合时宜的术语。可以进一步补充使用刮纸。我查看了Box Office Mojo,但这需要selenium
才能进行抓取,因为它是动态HTML
IMdB来源
import requests,json,re,urllib.parse
from bs4 import BeautifulSoup
import pandas as pd
import wget,gzip
from pathlib import Path
import numpy as np
# find what IMdB has to give ...
resp = requests.get("https://datasets.imdbws.com")
soup = BeautifulSoup(resp.content.decode(),"html.parser")
files = {}
for f in soup.find_all("a",href=True):
if f["href"].endswith('gz'):
u = urllib.parse.urlparse(f["href"])
fn = Path().cwd().joinpath(u.path.strip("/"))
files[Path(fn.stem).stem] = fn.name
if not fn.is_file():
wget.download(f["href"])
IMdB变换
在第一次运行时设置alldata=True
以准备数据。第二次运行错误,您有一个可管理的子集
alldata = False
subsetdata = True
dfs={}
# work with a subset of data to speed up modelling and iterations. Take a few major actors and titles
# as criteria to build a manageable representative set of data
l = ["Tom Hanks","Will Smith","Clint Eastwood","Leonardo DiCaprio","Johnny Depp","Meryl Streep","Bruce Willis"]
tm = {'tconst': ['tt0111161','tt0468569','tt1375666','tt0137523','tt0110912','tt0109830','tt0944947','tt0133093','tt0120737','tt0167260','tt0068646'],'averageRating': [9.3,9.0,8.8,8.9,9.3,8.7,9.2],'numVotes': [2275837,2237966,1997918,1805137,1777920,1752954,1699318,1630083,1618100,1602417,1570167]}
# work with subset for modelling purpose
k = "name.basics"
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]),sep="\t").replace({"\\N":np.nan})
if subsetdata:
# manage down size of nmi
dfs[k] = dfs[k].loc[(dfs[k]["primaryName"].isin(l)
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][0])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][1])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][2])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][3])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][4])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][5])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][6])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][7])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][8])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][9])
)
&dfs[k]["knownForTitles"].str.contains("tt")]
dfs[k].to_csv(f"{files[k]}_subset.tsv",sep="\t",index=False)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv",sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns},errors="ignore")
# birth year is a lot but getting data issues...
# dfs[k] = dfs[k].dropna(subset=["primaryProfession","birthYear"])
# comma separated - not good for joins and merges. rename for consistency
dfs["nmi"] = (dfs["name.basics"].loc[:,["nconst","knownForTitles"]]
.assign(knownForTitles=lambda x: x["knownForTitles"].str.split(","))
.explode("knownForTitles")
).rename(columns={"knownForTitles":"tconst"}).drop_duplicates()
# already extracted known titles so can drop and de-dup - e.g. Tom Hanks
dfs[k] = dfs[k].drop(columns=["knownForTitles"]).drop_duplicates()
for k in [k for k in files.keys() if k not in ["name.basics","omdb.titles"]]:
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]),sep="\t").replace({"\\N":np.nan})
if k=="title.akas": dfs[k]=dfs[k].rename(columns={"titleId":"tconst"})
# subset titles to those we have names
if subsetdata:
c = "tconst" if k!= "title.episode" else "parentTconst"
try:
(dfs[k].loc[dfs[k][c].isin(dfs["nmi"]["tconst"])]
.to_csv(f"{files[k]}_subset.tsv",index=False))
except KeyError as e:
print(k,dfs[k].columns,e)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv",sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns},errors="ignore")
dfs["name.and.titles"] = dfs["nmi"].merge(dfs["name.basics"],on="nconst").merge(dfs["title.basics"],on="tconst")
OMDB采购
omdbcols = ['Title','Year','Rated','Released','Runtime','Genre','Director','Writer','Actors','Plot','Language','Country','Awards','Poster','Ratings','Metascore','imdbRating','imdbVotes','imdbID','Type','DVD','BoxOffice','Production','Website','Response']
omdbk = "omdb.titles"
files[omdbk] = f"{omdbk}.tsz"
if not Path().cwd().joinpath(files[omdbk]).is_file():
dfs[omdbk] = pd.DataFrame(columns=omdbcols)
else:
dfs[omdbk] = pd.read_csv(files[omdbk],thousands=",")
dfs[omdbk] = dfs[omdbk].astype({c:"Int64" for c in dfs[omdbk].columns},errors="ignore")
k = "title.basics"
# limited to 1000 API calls a day,so only fetch if have not done already
for tconst in dfs[k].loc[~(dfs[k]["tconst"].isin(dfs[omdbk]["imdbID"]))]["tconst"].values:
# tt0109830 movie Forrest Gump
# http://www.omdbapi.com/?i=tt3896198&apikey=xxx
params={"apikey":apikey,"i":tconst,"plot":"full"}
res = requests.get("http://www.omdbapi.com/",params=params)
if res.status_code!=200:
print("breached API limit")
break
else:
dfs[omdbk] = pd.concat([dfs[omdbk],pd.json_normalize(res.json())])
dfs[omdbk].to_csv(files[omdbk],index=False,sep="\t")
样本分析
# The Dark Knight tt0468569
# Game of Throne tt0944947
# for demo purpose - just pick first association when there are many
mask = dfs[omdbk]["imdbID"].isin(["tt0468569","tt0944947"])
demo = (dfs[omdbk].loc[mask]
.rename(columns={c:f"OMDB{c}" for c in dfs[omdbk].columns})
.rename(columns={"OMDBimdbID":"tconst"})
.merge(dfs["title.basics"],on="tconst")
.merge(dfs["title.ratings"],on="tconst")
.merge(dfs["title.akas"].groupby("tconst",as_index=False).first(),on="tconst")
.merge(dfs["title.crew"].groupby("tconst",on="tconst")
.merge(dfs["title.principals"].groupby("tconst",on="tconst")
.merge(dfs["title.episode"].groupby("parentTconst",left_on="tconst",right_on="parentTconst",how="left",suffixes=("","_ep"))
.merge(dfs["nmi"]
.merge(dfs["name.basics"],on="nconst")
.groupby(["tconst"],on="tconst","_name"))
).T
输出
0 1
OMDBTitle The Dark Knight Game of Thrones
OMDBYear 2008 2011–2019
OMDBRated PG-13 TV-MA
OMDBReleased 18 Jul 2008 17 Apr 2011
OMDBRuntime 152 min 57 min
OMDBGenre Action,Crime,Drama,Thriller Action,Adventure,Fantasy,Romance
OMDBDirector Christopher Nolan NaN
OMDBWriter Jonathan Nolan (screenplay),Christopher Nolan (screenplay),Christopher Nolan (story),David S. Goyer (story),Bob Kane (characters) David Benioff,D.B. Weiss
OMDBActors Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine Peter Dinklage,Lena Headey,Emilia Clarke,Kit Harington
OMDBLanguage English,Mandarin English
OMDBCountry USA,UK USA,UK
OMDBAwards Won 2 Oscars. Another 153 wins & 159 nominations. Won 1 Golden Globe. Another 374 wins & 602 nominations.
OMDBPoster https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw@@._V1_SX300.jpg https://m.media-amazon.com/images/M/MV5BYTRiNDQwYzAtMzVlZS00NTI5LWJjYjUtMzkwNTUzMWMxZTllXkEyXkFqcGdeQXVyNDIzMzcwNjc@._V1_SX300.jpg
OMDBRatings [{'Source': 'Internet Movie Database','Value': '9.0/10'},{'Source': 'Rotten Tomatoes','Value': '94%'},{'Source': 'Metacritic','Value': '84/100'}] [{'Source': 'Internet Movie Database','Value': '9.3/10'}]
OMDBMetascore 84 <NA>
OMDBimdbRating 9 9.3
OMDBimdbVotes 2234169 1679892
tconst tt0468569 tt0944947
OMDBType movie series
OMDBDVD 09 Dec 2008 NaN
OMDBBoxOffice $533,316,061 NaN
OMDBProduction Warner Bros. Pictures/Legendary NaN
OMDBWebsite <NA> <NA>
OMDBResponse 1 1
OMDBtotalSeasons <NA> 8
titleType movie tvSeries
primaryTitle The Dark Knight Game of Thrones
originalTitle The Dark Knight Game of Thrones
isAdult 0 0
startYear 2008 2011
endYear <NA> 2019
runtimeMinutes 152 57
genres Action,Drama Action,Drama
averageRating 9 9.3
numVotes 2237966 1699318
ordering_x 10 10
title The Dark Knight Taht Oyunları
region GB TR
language en tr
types imdbDisplay imdbDisplay
attributes fake working title literal title
isOriginalTitle 0 0
directors nm0634240 nm0851930,nm0551076,nm0533713,nm0336241,nm1888967,nm1047532,nm0764601,nm0007008,nm0617042,nm0787687,nm0687964,nm0070474,nm1125275,nm0638354,nm0002399,nm0806252,nm0755261,nm0887700,nm0590889
writers nm0634300,nm0634240,nm0333060,nm0004170 nm1125275,nm0552333,nm4984276,nm2643685,nm7260047,nm2977599,nm0961827,nm0260870
ordering_y 10 10
nconst nm0746273 nm0322513
category producer actor
job producer creator
characters ["Bruce Wayne"] ["Jorah Mormont"]
parentTconst NaN tt0944947
tconst_ep NaN tt1480055
seasonNumber <NA> 1
episodeNumber <NA> 1
nconst_name nm0000198 nm0000293
primaryName Gary Oldman Sean Bean
birthYear 1958 1959
deathYear 1998 2020
primaryProfession actor,soundtrack,producer actor,producer,animation_department
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。