如何解决如何通过从 txt 文件中读取字典列表来创建 Pandas DataFrame?
我使用 tweepy 下载了 Twitter 数据,并将每条推文存储在 tweet_data 中。
tweet_data = []
for tweet_id in tweet_id_list:
try:
tweet_line = api.get_status(tweet_id,trim_user = True,include_my_retweet = False,include_entities = False,include_ext_alt_text = False,tweet_mode = 'extended')
tweet_data.append(tweet_line)
except:
continue # if tweet_id not found in twitter,move on to next tweet_id
将 tweet_data 放入 'twitter_json.txt'。
with open('twitter_json.txt','w') as txt:
for data in tweet_data:
tweet = data._json
tweet = json.dumps(tweet)
try:
txt.write(tweet + '\n')
except Exception as e:
print(e)
这是文本文件中的部分数据。
{"created_at": "Tue Aug 01 16:23:56 +0000 2017","id": sample_01,"id_str": sample_01,"full_text": "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 ","truncated": false,"display_text_range": [0,85],"extended_entities": {"media": [{"id": 892420639486877696,"id_str": "892420639486877696","indices": [86,109],"media_url": "some_url","media_url_": "some_url","url": some_url,"display_url": some_url,"expanded_url": some_url,"type": "photo","sizes": {"thumb": {"w": 150,"h": 150,"resize": "crop"},"medium": {"w": 540,"h": 528,"resize": "fit"},"small": {"w": 540,"large": {"w": 540,"resize": "fit"}}}]},"source": "<a some_url","in_reply_to_status_id": null,"in_reply_to_status_id_str": null,"in_reply_to_user_id": null,"in_reply_to_user_id_str": null,"in_reply_to_screen_name": null,"user": {"id": 4196983835,"id_str": "4196983835"},"geo": null,"coordinates": null,"place": null,"contributors": null,"is_quote_status": false,"retweet_count": 7427,"favorite_count": 35179,"favorited": false,"retweeted": false,"possibly_sensitive": false,"possibly_sensitive_appealable": false,"lang": "en"}
{"created_at": "Tue Aug 01 00:17:27 +0000 2017","id": sample_02,"id_str": sample_02,"full_text": "This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not,she's available for pats,snugs,boops,the whole bit. 13/10 some_url",138],"extended_entities": {"media": [{"id": 892177413194625024,"id_str": "892177413194625024","indices": [139,162],"url": "some_url","display_url": "some_url","expanded_url": "some_url","medium": {"w": 1055,"h": 1200,"small": {"w": 598,"h": 680,"large": {"w": 1407,"h": 1600,"source": "some_url","retweet_count": 5524,"favorite_count": 30458,"lang": "en"}
下一步...读取 'twitter_json.txt' 文件,我想用 Pandas 创建一个 DataFrame。
with open('twitter_json.txt') as txt:
data = [line.strip() for line in txt]
这是创建的数据框的快照,结果似乎不太正确。
print(pd.DataFrame(data))
0
0 {"created_at": "Tue Aug 01 16:23:56 +0000 2017...
1 {"created_at": "Tue Aug 01 00:17:27 +0000 2017...
我希望数据框具有诸如“created_at”、“id”、“id_str”等列。我该怎么做?
解决方法
如果您稍微修改您的工作流程,它就会起作用。我使用了不同的写/读例程来完成这项工作。另外,我使用的是我自己的数据,所以输出不会是你的数据。
# create list of json formats first,then write to file
write_data = [tweet._json for tweet in tweet_data]
# write to file
f = open('twitter_json.txt',"w+")
f.write(json.dumps(write_data))
f.close()
# read with json.loads
with open('twitter_json.txt','rb') as f:
data = json.loads(f.read().decode('utf-8'))
pd.DataFrame(data)
输出
created_at id id_str full_text truncated display_text_range ... retweet_count favorite_count favorited retweeted possibly_sensitive lang
0 Fri Jan 08 11:16:09 +0000 2021 1347502345517735940 1347502345517735940 ? La suite du voyage du futur métro d’Hanoï,f... False [0,211] ... 1 3 False False False fr
1 Fri Jan 08 11:15:31 +0000 2021 1347502185920286722 1347502185920286722 ? The continuation of the journey of the futur... False [0,211] ... 1 5 False False False en
[2 rows x 26 columns]
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。