如何使用 Pandas groupby 方法以 00:00 小时为第一个 bin 的中心对小时进行分组

如何解决如何使用 Pandas groupby 方法以 00:00 小时为第一个 bin 的中心对小时进行分组

使用 Pandas groupby 方法按一天中的小时对数据进行分组非常简单：

import pandas as pd
import numpy as np

# Create a sample dataset,a value for each hour in 48 hour
size = 48
df = pd.DataFrame(np.random.rand(size),index=pd.date_range('2021-01-01',periods=size,freq='H'))

# Group the data by hour of day and find the mean
df.groupby(df.index.hour).mean()

有时，需要将小时分组到 bin 中，这是通过 pandas.cut 方法完成的，如 here 所示。这将小时分为 00:00-05:59、06:00-11:59、12:00-17:59 和 18:00-23:59

# Group by bins
bins = [0,6,12,18,24]
df['time_bin'] = pd.cut(df.index.hour,bins,right=False)
df.groupby('time_bin').mean()

但是，通常需要对小时进行分箱，使小时 00:00 位于第一个分箱的中心， 21:00-02:59、03:00-08:59、09:00-14:59 和 15:00-20:59，但这是不可能的...

# Use 00:00 as center of first bin
bins = [21,3,9,15,21]
df['time_bin'] = pd.cut(df.index.hour,right=False)

# ValueError: bins must increase monotonically.

如何按小时分组使 00:00 小时位于第一个分组的中心？

解决方法

使用 offset/resample 的 pd.Grouper 参数。我将创建一个具有第二分辨率的简单 DataFrame 并以这种方式制作一列索引值，当我们resample 可以看到每个 bin 中的最小和最大时间作为概念证明时。

import pandas as pd

# Create a sample dataset,a value for each hour in 48 hour
size = 26*60*60
df = pd.DataFrame(range(size),index=pd.date_range('2020-12-31 11:00:00',periods=size,freq='s'))
df['time'] = df.index

使用 2 小时的时间段重新采样一个小时。因为 resample/pd.Grouper 的默认值是：

origin='start_day'：原点是第一天的午夜时间序列

我们可以确定偏移量将 bin 从 [0-2),[2,4) 移动到 [23-1),[1,3),...

res = df.resample('2H',offset='1H')['time'].agg(['min','max'])

#                                    min                 max
#2020-12-31 11:00:00 2020-12-31 11:00:00 2020-12-31 12:59:59
#2020-12-31 13:00:00 2020-12-31 13:00:00 2020-12-31 14:59:59
#2020-12-31 15:00:00 2020-12-31 15:00:00 2020-12-31 16:59:59
#2020-12-31 17:00:00 2020-12-31 17:00:00 2020-12-31 18:59:59
#...
#2021-01-01 11:00:00 2021-01-01 11:00:00 2021-01-01 12:59:59

箱（即索引）被标记为左边缘；您可以通过在重新采样中添加偏移量来事后调整。

from pandas.tseries.frequencies import to_offset

res.index = res.index + to_offset('1H')

#                                    min                 max
#2020-12-31 12:00:00 2020-12-31 11:00:00 2020-12-31 12:59:59
#2020-12-31 14:00:00 2020-12-31 13:00:00 2020-12-31 14:59:59
#2020-12-31 16:00:00 2020-12-31 15:00:00 2020-12-31 16:59:59
#2020-12-31 18:00:00 2020-12-31 17:00:00 2020-12-31 18:59:59
#....
#2021-01-01 12:00:00 2021-01-01 11:00:00 2021-01-01 12:59:59

我不得不抵消一些时间，以使它们对于我需要的垃圾箱来说是单调的。下面将 23 小时设为 -1，将 22 小时设为 -2，将 21 小时设为 -3。

# Create column of monotonic hours for the desired bins
hours = df.index.hour.to_numpy()
hours[df.index.hour>=21] -= 24
df['hours'] = hours

现在我们可以将第一个 bin 指定为 -03:00-02:59，它将 00:00 放在那个 bin 的中心。

bins = [-3,3,9,15,21]
df['time_bin'] = pd.cut(df['hours'],bins,right=False)
df.groupby('time_bin').mean()

xarray 用户的旁注：此通用方法也可用于 xarray.groupby_bins 方法。

# Where `ds` is an xarray.Dataset with a `time` coordinate.
HOURS = ds.time.dt.hour
HOURS[HOURS>=21] -= 24
bins = [-3,21]
ds.groupby_bins(HOURS,right=False).mean()