如何解决使用块创建数据框字典
我有一个类型为df
的数据框
permno date time_avail_m ... OperProfRD_q _merge ret
100000 11167 1989-01-31 1989m1 ... NaN both -0.170732
100001 11167 1989-02-28 1989m2 ... NaN both -0.088235
100002 11167 1989-03-31 1989m3 ... NaN both -0.064516
100003 11167 1989-05-31 1989m5 ... NaN both 0.181818
100004 11167 1989-06-30 1989m6 ... NaN both 0.179487
df.info()
的结果是
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries,10000 to 19999
Columns: 320 entries,permno to ret
dtypes: datetime64[ns](1),float64(304),int64(13),object(2)
memory usage: 24.4+ MB
None
这是通过对我的数据帧df.head
分块运行df
来获得的输出。
我需要创建一个数据帧字典,其中字典关键字是列date
中的值,关键字是索引为permno
的数据帧,而df
的其余列为列。有没有有效的方法可以做到这一点?我想分块执行此操作,因为df
是一个相当大的数据库
解决方法
这是一个示例,该示例说明如何对以块读取的内存不足数据实施groupby
操作。
样本数据
import pandas as pd
file = 'C:/users/ricar/downloads/mushrooms.csv' # downloaded from kaggle
# df = pd.read_csv(file,nrows=2)
# df.info()
# Data columns (total 23 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 class 2 non-null object
# 1 cap-shape 2 non-null object
# 2 cap-surface 2 non-null object
# 3 cap-color 2 non-null object
# 4 bruises 2 non-null object
# 5 odor 2 non-null object
# 6 gill-attachment 2 non-null object
# 7 gill-spacing 2 non-null object
# 8 gill-size 2 non-null object
# 9 gill-color 2 non-null object
# 10 stalk-shape 2 non-null object
# 11 stalk-root 2 non-null object
# 12 stalk-surface-above-ring 2 non-null object
# 13 stalk-surface-below-ring 2 non-null object
# 14 stalk-color-above-ring 2 non-null object
# 15 stalk-color-below-ring 2 non-null object
# 16 veil-type 2 non-null object
# 17 veil-color 2 non-null object
# 18 ring-number 2 non-null object
# 19 ring-type 2 non-null object
# 20 spore-print-color 2 non-null object
# 21 population 2 non-null object
# 22 habitat 2 non-null object
# dtypes: object(23)
# memory usage: 496.0+ bytes
建立石斑鱼
from collections import defaultdict
# pick your pivot columns
idx = 'cap-shape'
grouper = ['cap-surface']
# populate the grouper
groups = defaultdict(list)
for chunk in pd.read_csv(file,usecols=grouper,chunksize=1000):
chunk = chunk.reset_index().set_index(grouper).squeeze()
for key,g in chunk.groupby(chunk.index):
groups[key].extend(g.to_list())
使用它来过滤块中加载的数据
# load a single sub-dataframe
def load_subdf(key,**kwargs):
out = []
for chunk in pd.read_csv(file,**kwargs):
out.append(chunk[chunk[grouper[0]].eq(key)])
return pd.concat(out).drop(columns=grouper)
df_f = load_subdf('f',index_col=idx,chunksize=1000)
输出
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2320 entries,x to k
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 class 2320 non-null object
1 cap-color 2320 non-null object
2 bruises 2320 non-null object
3 odor 2320 non-null object
4 gill-attachment 2320 non-null object
5 gill-spacing 2320 non-null object
6 gill-size 2320 non-null object
7 gill-color 2320 non-null object
8 stalk-shape 2320 non-null object
9 stalk-root 2320 non-null object
10 stalk-surface-above-ring 2320 non-null object
11 stalk-surface-below-ring 2320 non-null object
12 stalk-color-above-ring 2320 non-null object
13 stalk-color-below-ring 2320 non-null object
14 veil-type 2320 non-null object
15 veil-color 2320 non-null object
16 ring-number 2320 non-null object
17 ring-type 2320 non-null object
18 spore-print-color 2320 non-null object
19 population 2320 non-null object
20 habitat 2320 non-null object
dtypes: object(21)
memory usage: 398.8+ KB
请注意,索引不再是默认范围索引,并且grouper列也不是结果的一部分。
第一个答案:
您的数据帧足够小,可以在内存中进行重塑...请尝试以下操作
df = df.set_index('permno') # discard current index
dict_dfs = {date: gdf for date,gdf in df.groupby('date')}
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。