给定此数据框:
生产:
即将多索引的分组级别填充到标准长度(行数)
在相当大的多索引数据帧(〜几千列和几百万行)上,是否有一种合理的快速方法?
这是给定的数据框字典,以供快速参考:
d = {'region': {0: 'intro',1: 'intro',2: 'intro',3: 'mid',4: 'mid',5: 'start',6: 'start',7: 'start',8: 'title',9: 'title'},'feat_index': {0: 9,1: 3,2: 0,3: 7,4: 8,5: 2,6: 4,7: 1,8: 6,9: 5},'position_in_region': {0: 422,1: 5834,2: 8813,3: 3187,4: 9407,5: 997,6: 3154,7: 8416,8: 5408,9: 8421},'document_0': {0: 0.39,1: 0.79,2: 0.01,3: 0.55,4: 0.99,5: 0.67,6: 0.61,7: 0.84,8: 0.15,9: 0.23},'document_1': {0: 0.8,1: 0.06,2: 0.92,3: 0.74,4: 0.06,5: 0.96,6: 0.57,7: 0.19,8: 0.29,9: 0.24},'document_2': {0: 0.81,1: 0.15,2: 0.19,3: 0.17,4: 0.11,5: 0.34,6: 0.8,7: 0.03,8: 0.67,9: 0.46}}
df = pd.DataFrame(d).set_index(['region','feat_index','position_in_region'])
最佳答案
您可以使用由
numpy.repeat
和numpy.tile
创建的帮助器DataFrame与左联接合并:
#get number of new rows by Counter.most_common(1)
from collections import Counter
no_vals = Counter(df.index.labels[0]).most_common(1)[0][1]
print(no_vals)
3
df1 = pd.DataFrame({'region':np.repeat(df.index.levels[0],no_vals),'id': np.tile(np.arange(no_vals),len(np.unique(df.index.labels[0])))})
print (df1)
region id
0 intro 0
1 intro 1
2 intro 2
3 mid 0
4 mid 1
5 mid 2
6 start 0
7 start 1
8 start 2
9 title 0
10 title 1
11 title 2
#MultiIndex to columns
df = df.reset_index()
#new Could with counter of regions
df.insert(1,'id',df.groupby('region').cumcount())
#merge,remove helper id columns and create MultiIndex
df = (df1.merge(df,how='left')
.drop('id',1)
.set_index(['region','position_in_region']))
print (df)
document_0 document_1 document_2
region feat_index position_in_region
intro 9.0 422.0 0.39 0.80 0.81
3.0 5834.0 0.79 0.06 0.15
0.0 8813.0 0.01 0.92 0.19
mid 7.0 3187.0 0.55 0.74 0.17
8.0 9407.0 0.99 0.06 0.11
NaN NaN NaN NaN NaN
start 2.0 997.0 0.67 0.96 0.34
4.0 3154.0 0.61 0.57 0.80
1.0 8416.0 0.84 0.19 0.03
title 6.0 5408.0 0.15 0.29 0.67
5.0 8421.0 0.23 0.24 0.46
NaN NaN NaN NaN NaN
DataFrame.reindex
和MultiIndex.from_product
的另一种解决方案:
from collections import Counter
no_vals = Counter(df.index.labels[0]).most_common(1)[0][1]
print(no_vals)
3
mux = pd.MultiIndex.from_product([df.index.levels[0],np.arange(no_vals)],names=['region','id'])
print (mux)
MultiIndex(levels=[['intro','mid','start','title'],[0,1,2]],codes=[[0,2,3,3],'id'])
df = df.reset_index(level=[1,2]).set_index(df.groupby(level=0).cumcount(),append=True)
df = (df.reindex(mux).reset_index(level=1,drop=True)
.set_index(['feat_index','position_in_region'],append=True))
print (df)
document_0 document_1 document_2
region feat_index position_in_region
intro 9.0 422.0 0.39 0.80 0.81
3.0 5834.0 0.79 0.06 0.15
0.0 8813.0 0.01 0.92 0.19
mid 7.0 3187.0 0.55 0.74 0.17
8.0 9407.0 0.99 0.06 0.11
NaN NaN NaN NaN NaN
start 2.0 997.0 0.67 0.96 0.34
4.0 3154.0 0.61 0.57 0.80
1.0 8416.0 0.84 0.19 0.03
title 6.0 5408.0 0.15 0.29 0.67
5.0 8421.0 0.23 0.24 0.46
NaN NaN NaN NaN NaN
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。