具有不同聚合的不同列的分组与另一个数据集的 cumsum

如何解决具有不同聚合的不同列的分组与另一个数据集的 cumsum

我有一个按日期和时间排序的数据框： df1：

ID    Date     A_sum  A_count   B_sum   B_count  A_last  B_last  
abc   01/jan    26       2        25       2       0      0
xyz   01/jan    54       3        45       3       4      6

df2：

ID     Date     Time      A         B
abc   02/jan     11       10        10 
abc   02/jan     12       14        13
xyz   02/jan      1       26        24
xyz   02/jan      2       18        15
xyz   02/jan      3       20        16

我想在 id 上附加这两个 dfs 并希望将 df2 作为输出更新为：

ID    Date     A_sum             A_count    B_sum   B_count  A_last  B_last  
abc   02/jan  50 #26+10+14        4 #2+2     48       4      14      13
xyz   02/jan  118 #54+26+18+20    6 #3+3    100       6      20      16

因此它从 df1 中获取列的先前值并将其添加到 df2 中

解决方法

您可以使用 .groupby() 和 named aggregation 将 df2 转换为与 df1 相同的布局，然后将结果附加到 {{1 }}，然后是另一轮df1和聚合，如下：

groupby()

结果：

df3 = (df2.groupby(['ID','Date'],as_index=False,sort=False)
          .agg(A_sum=('A','sum'),A_count=('A','count'),B_sum=('B',B_count=('A',A_last=('A','last'),B_last=('B','last'))
      )

df_out = (df1.append(df3)
             .groupby('ID',as_index=False)
             .agg({'Date': 'last','A_sum': 'sum','A_count': 'sum','B_sum': 'sum','B_count': 'sum','A_last': 'last','B_last': 'last'})
         )

略长的方式

>>> import pandas as pd
>>> from io import StringIO
>>>
>>> df1 = pd.read_csv(StringIO("""ID    Date     A_sum  A_count   B_sum   B_count  A_last  B_last
... abc   01/jan    26       2        25       2       0      0
... xyz   01/jan    54       3        45       3       4      6"""),sep="\s+")
>>>
>>>
>>> df2 = pd.read_csv(StringIO("""ID     Date     Time      A         B
... abc   02/jan     11       10        10
... abc   02/jan     12       14        13
... xyz   02/jan      1       26        24
... xyz   02/jan      2       18        15
... xyz   02/jan      3       20        16"""),sep="\s+")
>>>
>>>
>>>
>>> df2["A_sum"]   = df2.groupby("ID")["A"].transform("sum")
>>> df2["A_count"] = df2.groupby("ID")["A"].transform("count")
>>> df2["A_last"]  = df2.groupby("ID")["A"].transform("last")
>>>
>>> df2["B_sum"]   = df2.groupby("ID")["B"].transform("sum")
>>> df2["B_count"] = df2.groupby("ID")["B"].transform("count")
>>> df2["B_last"]  = df2.groupby("ID")["B"].transform("last")
>>>
>>> del df2["Time"]
>>> del df2["A"]
>>> del df2["B"]
>>>
>>> df2 = df2.groupby("ID").apply(lambda x: x.iloc[-1])
>>>
>>> df3 = pd.concat([df1,df2])
>>>
>>> df3.groupby('ID').agg({"Date": 'last','B_sum' : 'sum','B_last': 'last'})
       Date  A_sum  B_sum  A_count  B_count  A_last  B_last
ID
abc  02/jan     50     48        4        4      14      13
xyz  02/jan    118    100        6        6      20      16

for i in cols:
   df3 = (df2.groupby(['ID',as_index=False).agg(i+'_Num'=(i,i+'_denom'=(i,i+'_last'=(i,'last'))
   final = (df1.append(df3).groupby('ID',as_index=False).agg({i+'_Num':'sum',i+'_denom':'sum',i+'_Last': 'last'}))
But it is not working

您可以连接两个 df，然后您可以使用 groupby:

cols = df1.columns
df1 = df1[['ID','Date','A_sum','B_sum']]
df2 = df2.drop('Time',1)
df1.columns = df2.columns
merged_df  = pd.concat([df1,df2]).groupby(['ID']).agg({'A' : [sum,'count','last'],'B' : [sum,'Date': 'last'})
merged_df.columns = merged_df.columns.map('_'.join)

输出：

     A_sum  A_count  A_last  B_sum  B_count  B_last Date_last
ID                                                           
abc     50        3      14     48        3      13    02/jan
xyz    118        4      20    100        4      16    02/jan

使用：

#https://stackoverflow.com/a/67800033/2901002
cols = ['A','B']

df11 = df2.groupby(['ID','Date'])[cols].agg(['sum','count'])
df11.columns = df11.columns.map(lambda x: f'{x[0]}_{x[1]}')

df22 = df2.groupby(['ID','Date'])[cols].last().add_suffix('_last')

df3 = pd.concat([df11,df22],axis=1).reset_index(level=1)
print (df3)
       Date  A_sum  A_count  B_sum  B_count  A_last  B_last
ID                                                         
abc  02/jan     24        2     23        2      14      13
xyz  02/jan     64        3     55        3      20      16

仅过滤 df1 中的列以获得总和：

df33 = df1.filter(regex='ID|_sum|count').set_index('ID')
print (df33)
     A_sum  A_count  B_sum  B_count
ID                                 
abc     26        2     25        2
xyz     54        3     45        3

加入，sum 并在必要时分配缺失的 date：

df = pd.concat([df3,df33]).sum(level=0).astype(int).assign(Date = df3['Date']).set_index('Date',append=True).reset_index()
print (df)
    ID    Date  A_sum  A_count  B_sum  B_count  A_last  B_last
0  abc  02/jan     50        4     48        4      14      13
1  xyz  02/jan    118        6    100        6      20      16