有两个条件的累积和？

如何解决有两个条件的累积和？

我刚刚开始使用 Python 和 Pandas 来改善我的工作量。我的 df 如下：

df = pd.DataFrame({
'Div': [2,2,3,3],'date': ['01/09/2020','01/09/2020','02/09/2020','02/09/2020'],'income': [1000,1500,1000,500,700,2000,6000],'total':[0,0]
})

我需要计算每一行的累计总数，而 div 列不会改变。到目前为止，我已经设法这样做了：

df2=df
for i in df.index:
    for j in df2.index:
        if (df.loc[i,'Div']==df2.loc[j,'Div'] and df2.loc[j,'date']<=df.loc[i,'date']):
            df.loc[i,'total']+=df2.loc[j,'income']

结果如下：

Div	日期	收入	总计
2	01/09/2020	1000	3500
2	01/09/2020	1500	3500
2	01/09/2020	1000	3500
2	02/09/2020	500	4000
3	01/09/2020	700	3700
3	01/09/2020	2000	3700
3	01/09/2020	1000	3700
3	02/09/2020	6000	9700

它有效，但我的原始文件有 13000 行，需要 2 个多小时才能完成。我一直在网上阅读，到处都说在使用 Pandas 时应该避免迭代，但我找不到适合我的问题的解决方案。

有更好的方法吗？

解决方法

你想

将总和计算为累积总和固定 div？使用：groupby

df = pd.DataFrame({
'Div': [2,2,3,3],'date': ['01/09/2020','01/09/2020','02/09/2020','02/09/2020'],'income': [1000,1500,1000,500,700,2000,6000],'total':[0,0]
})

df['total'] = df.groupby("Div date".split()).cumsum()
display(df)

# step 1 calculate the sum income by group 'Div' and 'date'
df1 = df.groupby(['Div','date'])['income'].sum().reset_index()
#   Div date    income
# 0 2   01/09/2020  3500
# 1 2   02/09/2020  500
# 2 3   01/09/2020  3700
# 3 3   02/09/2020  6000


# step 2 calculate the cumsum income by 'Div'
df1['total'] = df1.groupby('Div')['income'].cumsum()
# Div   date    income  total
# 0 2   01/09/2020  3500    3500
# 1 2   02/09/2020  500 4000
# 2 3   01/09/2020  3700    3700
# 3 3   02/09/2020  6000    9700



# step3 merge the origin df with total column
del df['total']
pd.merge(df,df1[['Div','date','total']],on=['Div','date'],how='left')

#   Div date    income  total
# 0 2   01/09/2020  1000    3500
# 1 2   01/09/2020  1500    3500
# 2 2   01/09/2020  1000    3500
# 3 2   02/09/2020  500 4000
# 4 3   01/09/2020  700 3700
# 5 3   01/09/2020  2000    3700
# 6 3   01/09/2020  1000    3700
# 7 3   02/09/2020  6000    9700

通过Div和date分组使用sum，然后通过Div分组使用cumsum，然后将列total分配给计算结果。

result = (
df.set_index(['Div','date'])
  .assign(
      total=df.groupby(['Div','date'])['income'].sum().groupby("Div").cumsum()
  ).reset_index()
)
result

输出：

    Div date        income  total
0   2   01/09/2020  1000    3500
1   2   01/09/2020  1500    3500
2   2   01/09/2020  1000    3500
3   2   02/09/2020  500     4000
4   3   01/09/2020  700     3700
5   3   01/09/2020  2000    3700
6   3   01/09/2020  1000    3700
7   3   02/09/2020  6000    9700

我会这样做：

precalc = df.groupby(["Div","date"]).sum().reset_index().set_index("date")
df["total"] = df.apply(lambda row: precalc[precalc["Div"]==row["Div"]][:row["date"]]["income"].sum(),axis=1)

precalc 基本上是所有对（“Div”、“date”）的总和。然后在表的每一行上运行 apply 以创建 total，它是匹配 Div 的所有预计算行的总和，日期范围从开始到行的日期。

我的理论是通过提前进行预计算，您可以避免在创建累积总和时无缘无故地这样做。所以基本上假设给定的 Div 和日期有很多行。

这样做的一种好方法是使用 Groupby.transform：

AddOAuth