如何解决使用pythonJupyter notebook对json数据进行数据预处理
我正在尝试为 json 数据集实现一些预处理命令。使用 .csv 文件很容易,但我不知道如何实现一些预处理命令,如 isnull()、fillna()、dropna() 和 imputer 类。
以下是我已执行但未能执行上述操作的一些命令,因为我无法弄清楚如何使用 Json 文件数据集。
数据集链接:https://drive.google.com/file/d/1puNNrRaV-Jt_kt709fuYGCvDW9-EuwoB/view?usp=sharing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
dataset = pd.read_json('moviereviews.json',orient='columns')
print(dataset)
movies = pd.read_json( ( dataset).to_json(),orient='index')
print(movies)
print(type(movies))
movie = pd.read_json( ( dataset['12 Strong']).to_json(),orient='index')
print(movie)
movie_name = [
"12 Strong","A Ciambra","All The Money In The World","Along With The Gods: The Two Worlds","Bilal: A New Breed Of Hero","Call Me By Your Name","Condorito: La Película","Darkest Hour","Den Of Thieves","Downsizing","Father figures","Film Stars Don'T Die In Liverpool","Forever My Girl","Happy End","Hostiles","I,Tonya","In The Fade (Aus Dem Nichts)","Insidious: The Last Key","Jumanji: Welcome To The Jungle","Mary And The Witch'S Flower","Maze Runner: The Death Cure","Molly'S Game","Paddington 2","Padmaavat","Phantom Thread","Pitch Perfect 3","Proud Mary","Star Wars: Episode Viii - The Last Jedi","Star Wars: The Last Jedi","The Cage fighter","The Commuter","The Final Year","The Greatest Showman","The Insult (L'Insulte)","The Post","The Shape Of Water","Una Mujer Fantástica","Winchester"
]
print(movie_name)
data = []
for moviename in movie_name:
movie = pd.read_json( ( dataset[moviename]).to_json(),orient='index')
data.append(movie)
print(data)
解决方法
您对这个数据集的挑战之一是它对相同的数据有不同的键名,例如 'Tomato Score'
和 'tomatoscore'
。下面的解决方案不是最好的,它可以优化很多,但是,我这样说是为了让您更容易看到为使数据一致而实施的步骤:
import pandas as pd
with open('moviereviews.json',"r") as read_file:
dataset = json.load(read_file)
data = []
for index in range(len(dataset)):
for key in dataset[index]:
movie_name = key
if 'Genre' in dataset[index][key]:
genre = dataset[index][key]['Genre']
else:
genre = None
if 'Gross' in dataset[index][key]:
gross = dataset[index][key]['Gross']
else:
gross = None
if 'IMDB Metascore' in dataset[index][key]:
imdb = dataset[index][key]['IMDB Metascore']
else:
imdb = None
if 'Popcorn Score' in dataset[index][key]:
popcorn = dataset[index][key]['Popcorn Score']
elif 'popcornscore' in dataset[index][key]:
popcorn = dataset[index][key]['popcornscore']
else:
popcorn = None
if 'Rating' in dataset[index][key]:
rating = dataset[index][key]['Rating']
elif 'rating' in dataset[index][key]:
rating = dataset[index][key]['rating']
else:
rating = None
if 'Tomato Score' in dataset[index][key]:
tomato = dataset[index][key]['Tomato Score']
elif 'tomatoscore' in dataset[index][key]:
tomato = dataset[index][key]['tomatoscore']
else:
tomato = None
data.append({'Movie Name': movie_name,'Genre': genre,'Gross': gross,'IMDB Metascore': imdb,'Popcorn Score': popcorn,'Rating': rating,'Tomato Score': tomato})
df = pd.DataFrame(data)
df
,
您可以将字典中的项目拆分并单独阅读,一次性将 NaN 填充为 None。
如果你的json被称为数据,那么
df = pd.DataFrame(data[0].values()).fillna('None')
df['Movie Name'] = pd.DataFrame(data[0].keys())
df.set_index('Movie Name',inplace=True)
df.head()
Genre Gross IMDB Metascore Popcorn Score Rating Tomato Score popcornscore rating tomatoscore
Movie Name
12 Strong Action $1,465,000 54 72 R 54 None None None
A Ciambra Drama unknown 70 unknown unrated unkown None None None
All The Money In The World None None None None None None 72.0 R 76.0
Along With The Gods: The Two Worlds None None None None None None 90.0 NR 50.0
Bilal: A New Breed Of Hero Animation unknown 52 unknown unrated unkown None None None
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。