pandas.resample或 groupby的自定义间隔

如何解决pandas.resample或 groupby的自定义间隔

假设我从这个数据框开始

d = {'price': [10,12,8,14,18,10,20],'volume': [50,60,40,100,50,50]}
df = pd.DataFrame(d)
df['a_date'] = pd.date_range('01/01/2018',periods=8,freq='W')

df
    price   volume  a_date
0   10      50      2018-01-07
1   12      60      2018-01-14
2   8       40      2018-01-21
3   12      100     2018-01-28
4   14      50      2018-02-04
5   18      100     2018-02-11
6   10      40      2018-02-18
7   20      50      2018-02-25

现在，我想以这样一种方式重新采样/分组，即数据在大约 10 天的时间间隔内聚合，但具有预定义的开始和结束日期，分别落在 10 日、20 日和最后一天月份，例如：

2018-01-01 to 2018-01-10
2018-01-11 to 2018-01-20
2018-01-21 to 2018-01-31
2018-02-01 to 2018-02-10
2018-02-11 to 2018-02-20
2018-02-21 to 2018-02-28

如果跨区间求和，结果将是：

             price  volume  
a_date
2018-01-10   10     50      
2018-01-20   12     60      
2018-01-31   20     140     
2018-02-10   14     50      
2018-02-20   28     140     
2018-02-28   20     50

我能做到的最接近的是做 df.resample('10D',on='a_date').sum() 但显然我需要更自定义的东西作为间隔。我会很高兴只传递一组间隔，但我认为这是不可能的。

我已经尝试过，作为实验：

td = pd.to_datetime('2018-01-10') - pd.to_datetime('2018-01-01')
df.resample(td,on='a_date').sum()

但 pandas.Timedelta 不保留有关特定日期的信息。

编辑：

一个不同的数据框来测试一个月的第一天：

d = {'price': np.arange(20)+1,'volume': np.arange(20)+5}
df = pd.DataFrame(d)
df['a_date'] = pd.date_range('01/01/2018',periods=20,freq='D')

应用接受的答案给出（不考虑第一天）：

      a_date  price  volume
0 2018-01-10     54      90
1 2018-01-20    155     195

对比（第一个区间 2018-01-01 到 2018-01-10）：

df.iloc[:10].sum()

price     55
volume    95
dtype: int64

解决方法

试试：

from pandas.tseries.offsets import MonthEnd

bins = []
end = df["a_date"].max()
current = df["a_date"].min()
current = pd.Timestamp(year=current.year,month=current.month,day=1)
while True:
    bins.append(current)
    bins.append(current + pd.Timedelta(days=9))
    bins.append(current + pd.Timedelta(days=19))
    bins.append(current + MonthEnd())
    if bins[-1] > end:
        break
    current = bins[-1] + pd.Timedelta(days=1)

x = (df.groupby(pd.cut(df["a_date"],bins)).sum()).reset_index()
x["a_date"] = x["a_date"].cat.categories.right
print(x[~(x.price.eq(0) & x.volume.eq(0))])

打印：

      a_date  price  volume
0 2018-01-10     10      50
1 2018-01-20     12      60
2 2018-01-31     20     140
4 2018-02-10     14      50
5 2018-02-20     28     140
6 2018-02-28     20      50

编辑：调整后的垃圾箱：

from pandas.tseries.offsets import MonthEnd

end = df["a_date"].max()
current = df["a_date"].min()
bins = [
    pd.Timestamp(year=current.year,day=1) - MonthEnd(),]
current = bins[-1]
while True:
    bins.append(bins[-1] + pd.Timedelta(days=10))
    bins.append(bins[-1] + pd.Timedelta(days=10))
    bins.append(current + MonthEnd())
    if bins[-1] > end:
        break
    current = bins[-1]

x = (df.groupby(pd.cut(df["a_date"],bins)).sum()).reset_index()
x["a_date"] = x["a_date"].cat.categories.right
print(x[~(x.price.eq(0) & x.volume.eq(0))])

打印：

      a_date  price  volume
0 2018-01-10     55      95
1 2018-01-20    155     195

pandas.resample或 groupby的自定义间隔

如何解决pandas.resample或 groupby的自定义间隔

解决方法

相关推荐