微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

最好的Python循环系统,用于合并熊猫DataFrame行以进行导出 IMdB来源 IMdB变换 OMDB采购样本分析输出

如何解决最好的Python循环系统,用于合并熊猫DataFrame行以进行导出 IMdB来源 IMdB变换 OMDB采购样本分析输出

我是一名自学数据科学的学生,目前正在分多个步骤完成我的第一个大型Python档案项目,其中第一步是使用熊猫与IMDb [Internet Movie Database]的rather oddly structured .tsv file s努力创建一个包含所有IMDb数据的完全可搜索的大数据存储库(受正式支持搜索,甚至是OMDB (Open Movie Database)之类的API都不允许我为大型项目进行各种详细的查询)。>

IMDb的公共文件的结构是,它们包括电影,电视节目,剧集,演员,导演,剧组,整个业务中的所有数据,它们随意地散布在七个庞大的tsv文件中。我已经确认熊猫实际上可以读取所有这些数据,并且我的计算机的内存可以处理它,但是我要做的是将七个tsv文件合并到一个DataFrame对象中,然后可以将其导出为(最好)是一个sql数据库,甚至是一个巨大的电子表格/另一个TSV文件,但更大。

数据库中的每个事物(电影,演员,单个电视剧集)都有一个tconst行,该行在一个文件中被标识为“ titleId”,是一个字符串。在每个其他文件中,它都被标识为“ tconst”,也是一个字符串。当我将该文件读入tconst时,我将需要更改titleId。这是我还没有遇到的几个挑战之一。
#set pandas formatting parameters
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',25)

#read in the data tables provided
showbiz_core = pd.read_table("name.basics.tsv",sep='\t')

#temporary hack - print the entire dataframe as test
print(showbiz_core)

这可行,但是我不确定下一步如何继续。我想导入每个其他tsv文件,以尝试在本地重建imdb数据库。这意味着我不想有重复的tconst字符串,而最终要获得关于tconst条目(如电影)的新信息,并以新列的形式附加到它。

我应该以某种方式进行“为[新文件]中的我”类型的循环吗?您将如何处理?

解决方法

IMdB文件实际上是高度结构化的。循环始终是合并数据的不良结构

  1. 结构数据采购-我使用wget而不是手动采购
  2. 文件很大,因此可以使用子集进行建模。我只是用热门电影和演员当司机
  3. tsv 文件中的
  4. CSV列实际上是子表。这样对待他们。我为此建立了一个参考实体 dmi
  5. 那里还有其他关联关系, primaryProfession 流派
  6. 最后,将OMDB和IMdB中的所有内容最终结合在一起(合并)。采取第一行,其中许多项目与标题关联

我目前留下的数据为 tsv ,显然,使用to_sql()方法将其放入数据库非常简单。要点是采购和转型。又名ETL,这已成为不合时宜的术语。可以进一步补充使用刮纸。我查看了Box Office Mojo,但这需要selenium才能进行抓取,因为它是动态HTML

IMdB来源

import requests,json,re,urllib.parse
from bs4 import BeautifulSoup
import pandas as pd
import wget,gzip
from pathlib import Path
import numpy as np

# find what IMdB has to give ...
resp = requests.get("https://datasets.imdbws.com")
soup = BeautifulSoup(resp.content.decode(),"html.parser")
files = {}
for f in soup.find_all("a",href=True):
    if f["href"].endswith('gz'):
        u = urllib.parse.urlparse(f["href"])
        fn = Path().cwd().joinpath(u.path.strip("/"))
        files[Path(fn.stem).stem] = fn.name
        if not fn.is_file():
            wget.download(f["href"])


IMdB变换

在第一次运行时设置alldata=True以准备数据。第二次运行错误,您有一个可管理的子集

alldata = False
subsetdata = True

dfs={}

# work with a subset of data to speed up modelling and iterations.  Take a few major actors and titles
# as criteria to build a manageable representative set of data
l = ["Tom Hanks","Will Smith","Clint Eastwood","Leonardo DiCaprio","Johnny Depp","Meryl Streep","Bruce Willis"]
tm = {'tconst': ['tt0111161','tt0468569','tt1375666','tt0137523','tt0110912','tt0109830','tt0944947','tt0133093','tt0120737','tt0167260','tt0068646'],'averageRating': [9.3,9.0,8.8,8.9,9.3,8.7,9.2],'numVotes': [2275837,2237966,1997918,1805137,1777920,1752954,1699318,1630083,1618100,1602417,1570167]}

# work with subset for modelling purpose
k = "name.basics"
if alldata:
    dfs[k] = pd.read_csv(gzip.open(files[k]),sep="\t").replace({"\\N":np.nan})
    if subsetdata:
        # manage down size of nmi
        dfs[k] = dfs[k].loc[(dfs[k]["primaryName"].isin(l)
                            | dfs[k]["knownForTitles"].str.contains(tm["tconst"][0])
                            | dfs[k]["knownForTitles"].str.contains(tm["tconst"][1])
                            | dfs[k]["knownForTitles"].str.contains(tm["tconst"][2])
                            | dfs[k]["knownForTitles"].str.contains(tm["tconst"][3])
                            | dfs[k]["knownForTitles"].str.contains(tm["tconst"][4])
                            | dfs[k]["knownForTitles"].str.contains(tm["tconst"][5])
                            | dfs[k]["knownForTitles"].str.contains(tm["tconst"][6])
                            | dfs[k]["knownForTitles"].str.contains(tm["tconst"][7])
                            | dfs[k]["knownForTitles"].str.contains(tm["tconst"][8])
                            | dfs[k]["knownForTitles"].str.contains(tm["tconst"][9])
                            )
                            &dfs[k]["knownForTitles"].str.contains("tt")]
        dfs[k].to_csv(f"{files[k]}_subset.tsv",sep="\t",index=False)
else:
    dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv",sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns},errors="ignore")
# birth year is a lot but getting data issues...
# dfs[k] = dfs[k].dropna(subset=["primaryProfession","birthYear"])

# comma separated - not good for joins and merges. rename for consistency
dfs["nmi"] = (dfs["name.basics"].loc[:,["nconst","knownForTitles"]]
 .assign(knownForTitles=lambda x: x["knownForTitles"].str.split(","))
 .explode("knownForTitles")
).rename(columns={"knownForTitles":"tconst"}).drop_duplicates()
# already extracted known titles so can drop and de-dup - e.g. Tom Hanks
dfs[k] = dfs[k].drop(columns=["knownForTitles"]).drop_duplicates()

for k in [k for k in files.keys() if k not in ["name.basics","omdb.titles"]]:
    if alldata:
        dfs[k] = pd.read_csv(gzip.open(files[k]),sep="\t").replace({"\\N":np.nan})
        if k=="title.akas": dfs[k]=dfs[k].rename(columns={"titleId":"tconst"})
        # subset titles to those we have names
        if subsetdata:
            c = "tconst" if k!= "title.episode" else "parentTconst"
            try:
                (dfs[k].loc[dfs[k][c].isin(dfs["nmi"]["tconst"])]
                 .to_csv(f"{files[k]}_subset.tsv",index=False))
            except KeyError as e:
                print(k,dfs[k].columns,e)
    else:
        dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv",sep="\t")
    dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns},errors="ignore")

dfs["name.and.titles"] = dfs["nmi"].merge(dfs["name.basics"],on="nconst").merge(dfs["title.basics"],on="tconst")

OMDB采购

omdbcols = ['Title','Year','Rated','Released','Runtime','Genre','Director','Writer','Actors','Plot','Language','Country','Awards','Poster','Ratings','Metascore','imdbRating','imdbVotes','imdbID','Type','DVD','BoxOffice','Production','Website','Response']
omdbk = "omdb.titles"
files[omdbk] = f"{omdbk}.tsz"
if not Path().cwd().joinpath(files[omdbk]).is_file():
    dfs[omdbk] = pd.DataFrame(columns=omdbcols)
else:
    dfs[omdbk] = pd.read_csv(files[omdbk],thousands=",")
    dfs[omdbk] = dfs[omdbk].astype({c:"Int64" for c in dfs[omdbk].columns},errors="ignore")
    

k = "title.basics"
# limited to 1000 API calls a day,so only fetch if have not done already
for tconst in dfs[k].loc[~(dfs[k]["tconst"].isin(dfs[omdbk]["imdbID"]))]["tconst"].values:
    # tt0109830 movie   Forrest Gump
    # http://www.omdbapi.com/?i=tt3896198&apikey=xxx
    params={"apikey":apikey,"i":tconst,"plot":"full"}
    res = requests.get("http://www.omdbapi.com/",params=params)
    if res.status_code!=200:
        print("breached API limit")
        break
    else:
        dfs[omdbk] = pd.concat([dfs[omdbk],pd.json_normalize(res.json())])
    
dfs[omdbk].to_csv(files[omdbk],index=False,sep="\t")

样本分析

# The Dark Knight   tt0468569   
# Game of Throne tt0944947
# for demo purpose - just pick first association when there are many
mask = dfs[omdbk]["imdbID"].isin(["tt0468569","tt0944947"])
demo = (dfs[omdbk].loc[mask]
 .rename(columns={c:f"OMDB{c}" for c in dfs[omdbk].columns})
 .rename(columns={"OMDBimdbID":"tconst"})
 .merge(dfs["title.basics"],on="tconst")
 .merge(dfs["title.ratings"],on="tconst")
 .merge(dfs["title.akas"].groupby("tconst",as_index=False).first(),on="tconst")
 .merge(dfs["title.crew"].groupby("tconst",on="tconst")
 .merge(dfs["title.principals"].groupby("tconst",on="tconst")
 .merge(dfs["title.episode"].groupby("parentTconst",left_on="tconst",right_on="parentTconst",how="left",suffixes=("","_ep"))
 .merge(dfs["nmi"]
        .merge(dfs["name.basics"],on="nconst")
        .groupby(["tconst"],on="tconst","_name")) 

).T


输出

                                                                                                                                                                        0                                                                                                                                                                                              1
OMDBTitle                                                                                                                                                 The Dark Knight                                                                                                                                                                                Game of Thrones
OMDBYear                                                                                                                                                             2008                                                                                                                                                                                      2011–2019
OMDBRated                                                                                                                                                           PG-13                                                                                                                                                                                          TV-MA
OMDBReleased                                                                                                                                                  18 Jul 2008                                                                                                                                                                                    17 Apr 2011
OMDBRuntime                                                                                                                                                       152 min                                                                                                                                                                                         57 min
OMDBGenre                                                                                                                                  Action,Crime,Drama,Thriller                                                                                                                                                     Action,Adventure,Fantasy,Romance
OMDBDirector                                                                                                                                            Christopher Nolan                                                                                                                                                                                            NaN
OMDBWriter                          Jonathan Nolan (screenplay),Christopher Nolan (screenplay),Christopher Nolan (story),David S. Goyer (story),Bob Kane (characters)                                                                                                                                                                      David Benioff,D.B. Weiss
OMDBActors                                                                                                     Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine                                                                                                                                      Peter Dinklage,Lena Headey,Emilia Clarke,Kit Harington
OMDBLanguage                                                                                                                                            English,Mandarin                                                                                                                                                                                        English
OMDBCountry                                                                                                                                                       USA,UK                                                                                                                                                                                        USA,UK
OMDBAwards                                                                                                              Won 2 Oscars. Another 153 wins & 159 nominations.                                                                                                                                        Won 1 Golden Globe. Another 374 wins & 602 nominations.
OMDBPoster                                                             https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw@@._V1_SX300.jpg                                                             https://m.media-amazon.com/images/M/MV5BYTRiNDQwYzAtMzVlZS00NTI5LWJjYjUtMzkwNTUzMWMxZTllXkEyXkFqcGdeQXVyNDIzMzcwNjc@._V1_SX300.jpg
OMDBRatings        [{'Source': 'Internet Movie Database','Value': '9.0/10'},{'Source': 'Rotten Tomatoes','Value': '94%'},{'Source': 'Metacritic','Value': '84/100'}]                                                                                                                                     [{'Source': 'Internet Movie Database','Value': '9.3/10'}]
OMDBMetascore                                                                                                                                                          84                                                                                                                                                                                           <NA>
OMDBimdbRating                                                                                                                                                          9                                                                                                                                                                                            9.3
OMDBimdbVotes                                                                                                                                                     2234169                                                                                                                                                                                        1679892
tconst                                                                                                                                                          tt0468569                                                                                                                                                                                      tt0944947
OMDBType                                                                                                                                                            movie                                                                                                                                                                                         series
OMDBDVD                                                                                                                                                       09 Dec 2008                                                                                                                                                                                            NaN
OMDBBoxOffice                                                                                                                                                $533,316,061                                                                                                                                                                                            NaN
OMDBProduction                                                                                                                            Warner Bros. Pictures/Legendary                                                                                                                                                                                            NaN
OMDBWebsite                                                                                                                                                          <NA>                                                                                                                                                                                           <NA>
OMDBResponse                                                                                                                                                            1                                                                                                                                                                                              1
OMDBtotalSeasons                                                                                                                                                     <NA>                                                                                                                                                                                              8
titleType                                                                                                                                                           movie                                                                                                                                                                                       tvSeries
primaryTitle                                                                                                                                              The Dark Knight                                                                                                                                                                                Game of Thrones
originalTitle                                                                                                                                             The Dark Knight                                                                                                                                                                                Game of Thrones
isAdult                                                                                                                                                                 0                                                                                                                                                                                              0
startYear                                                                                                                                                            2008                                                                                                                                                                                           2011
endYear                                                                                                                                                              <NA>                                                                                                                                                                                           2019
runtimeMinutes                                                                                                                                                        152                                                                                                                                                                                             57
genres                                                                                                                                                 Action,Drama                                                                                                                                                                         Action,Drama
averageRating                                                                                                                                                           9                                                                                                                                                                                            9.3
numVotes                                                                                                                                                          2237966                                                                                                                                                                                        1699318
ordering_x                                                                                                                                                             10                                                                                                                                                                                             10
title                                                                                                                                                     The Dark Knight                                                                                                                                                                                  Taht Oyunları
region                                                                                                                                                                 GB                                                                                                                                                                                             TR
language                                                                                                                                                               en                                                                                                                                                                                             tr
types                                                                                                                                                         imdbDisplay                                                                                                                                                                                    imdbDisplay
attributes                                                                                                                                             fake working title                                                                                                                                                                                  literal title
isOriginalTitle                                                                                                                                                         0                                                                                                                                                                                              0
directors                                                                                                                                                       nm0634240  nm0851930,nm0551076,nm0533713,nm0336241,nm1888967,nm1047532,nm0764601,nm0007008,nm0617042,nm0787687,nm0687964,nm0070474,nm1125275,nm0638354,nm0002399,nm0806252,nm0755261,nm0887700,nm0590889
writers                                                                                                                           nm0634300,nm0634240,nm0333060,nm0004170                                                                                                      nm1125275,nm0552333,nm4984276,nm2643685,nm7260047,nm2977599,nm0961827,nm0260870
ordering_y                                                                                                                                                             10                                                                                                                                                                                             10
nconst                                                                                                                                                          nm0746273                                                                                                                                                                                      nm0322513
category                                                                                                                                                         producer                                                                                                                                                                                          actor
job                                                                                                                                                              producer                                                                                                                                                                                        creator
characters                                                                                                                                                ["Bruce Wayne"]                                                                                                                                                                              ["Jorah Mormont"]
parentTconst                                                                                                                                                          NaN                                                                                                                                                                                      tt0944947
tconst_ep                                                                                                                                                             NaN                                                                                                                                                                                      tt1480055
seasonNumber                                                                                                                                                         <NA>                                                                                                                                                                                              1
episodeNumber                                                                                                                                                        <NA>                                                                                                                                                                                              1
nconst_name                                                                                                                                                     nm0000198                                                                                                                                                                                      nm0000293
primaryName                                                                                                                                                   Gary Oldman                                                                                                                                                                                      Sean Bean
birthYear                                                                                                                                                            1958                                                                                                                                                                                           1959
deathYear                                                                                                                                                            1998                                                                                                                                                                                           2020
primaryProfession                                                                                                                               actor,soundtrack,producer                                                                                                                                                            actor,producer,animation_department

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。