python – Pandas：根据另一个DF选择DF行

我有两个数据帧(很长,每个有数百或数千行).其中一个名为df1,包含一个时间序列,间隔为10分钟.例如：

               date          value
2016-11-24 00:00:00    1759.199951
2016-11-24 00:10:00     992.400024
2016-11-24 00:20:00    1404.800049
2016-11-24 00:30:00      45.799999
2016-11-24 00:40:00      24.299999
2016-11-24 00:50:00     159.899994
2016-11-24 01:00:00      82.499999
2016-11-24 01:10:00      37.400003
2016-11-24 01:20:00     159.899994
....

而另一个,df2,包含日期时间间隔：

              start_date             end_date
0    2016-11-23 23:55:32  2016-11-24 00:14:03
1    2016-11-24 01:03:18  2016-11-24 01:07:12
2    2016-11-24 01:11:32  2016-11-24 02:00:00 
...

我需要选择df1中的所有行,这些行“落入”df2中的一个区间.

通过这些示例,结果数据框应为：

               date          value
2016-11-24 00:00:00    1759.199951   # Fits in row 0 of df2
2016-11-24 00:10:00     992.400024   # Fits in row 0 of df2
2016-11-24 01:00:00      82.499999   # Fits in row 1 of df2
2016-11-24 01:10:00      37.400003   # Fits on row 2 of df2
2016-11-24 01:20:00     159.899994   # Fits in row 2 of df2
....

解决方法:

使用np.searchsorted：

这是基于np.searchsorted的变体,似乎比使用intervaltree或merge快一个数量级,假设我的更大的样本数据是正确的.

# Ensure the df2 is sorted (skip if it's already kNown to be).
df2 = df2.sort_values(by=['start_date', 'end_date'])

# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

# Perform the searchsorted and get the corresponding df2 values for both endpoints of df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
    df1['date'].values <= s1['end_date'].values,
    df1['date_end'].values <= s2['end_date'].values,
    s1.index.values != s2.index.values
    ]

# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
df1 = df1[np.any(cond, axis=0)].drop('date_end', axis=1)

如果df2中的间隔嵌套或重叠,则可能需要修改此值;在那种情况下,我还没有完全考虑过,但它仍然可以工作.

使用间隔树

不完全是纯粹的Pandas解决方案,但您可能需要考虑从df2构建Interval Tree,并根据df1中的间隔查询它以找到重叠的那些.

PyPI上的intervaltree软件包似乎具有良好的性能和易于使用的语法.

from intervaltree import IntervalTree

# Build the Interval Tree from df2.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])

# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)

# Query the Interval Tree to filter df1.
df1 = df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]

出于性能原因,我将日期转换为等价的整数.我怀疑intervaltree包是用pd.Timestamp构建的,所以可能有一些中间转换步骤会让事情变慢.

另请注意,虽然包含起点,但intervaltree包中的间隔不包括终点.这就是为什么我在创建树时有[0,1]的原因;我将终点填充一个纳秒,以确保实际包含真正的终点.这也是为什么我可以在查询树时添加pd.offsets.Minute(10)以获得间隔结束,而不是仅添加9m 59s.

两种方法的结果输出：

                 date        value
0 2016-11-24 00:00:00  1759.199951
1 2016-11-24 00:10:00   992.400024
6 2016-11-24 01:00:00    82.499999
7 2016-11-24 01:10:00    37.400003
8 2016-11-24 01:20:00   159.899994

计时

使用以下设置生成更大的样本数据：

# Sample df1.
n1 = 55000
df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})

# Sample df2.
n2 = 500
df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})

# Randomly shift the start and end dates of the df2 intervals.
shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
df2['start_date'] += shift_start
df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2

这为df1和df2产生以下结果：

df1
                  date     value
0     2016-11-24 00:00:00  0.444939
1     2016-11-24 00:10:00  0.407554
2     2016-11-24 00:20:00  0.460148
3     2016-11-24 00:30:00  0.465239
4     2016-11-24 00:40:00  0.462691
...
54995 2017-12-10 21:50:00  0.754123
54996 2017-12-10 22:00:00  0.401820
54997 2017-12-10 22:10:00  0.146284
54998 2017-12-10 22:20:00  0.394759
54999 2017-12-10 22:30:00  0.907233

df2
              start_date            end_date
0   2016-11-24 00:00:19 2016-11-24 00:41:24
1   2016-11-24 18:22:44 2016-11-24 18:36:44
2   2016-11-25 12:44:44 2016-11-25 13:03:13
3   2016-11-26 07:07:05 2016-11-26 07:49:29
4   2016-11-27 01:29:31 2016-11-27 01:34:32
...
495 2017-12-07 21:36:04 2017-12-07 22:14:29
496 2017-12-08 15:58:14 2017-12-08 16:10:35
497 2017-12-09 10:20:21 2017-12-09 10:26:40
498 2017-12-10 04:42:41 2017-12-10 05:22:47
499 2017-12-10 23:04:42 2017-12-10 23:44:53

并使用以下函数进行计时：

def root_searchsorted(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # Build the conditions that indicate an overlap (any True condition indicates an overlap).
    cond = [
        df1['date'].values <= s1['end_date'].values,
        df1['date_end'].values <= s2['end_date'].values,
        s1.index.values != s2.index.values
        ]

    # Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
    return df1[np.any(cond, axis=0)].drop('date_end', axis=1)

def root_intervaltree(df1, df2):
    # Build the Interval Tree.
    tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])

    # Build the 10 minutes spans from df1.
    dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)

    # Query the Interval Tree to filter the DataFrame.
    return df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]

def ptrj(df1, df2):
    # The smallest amount of time - handy when using open intervals:
    epsilon = pd.timedelta(1, 'ns')

    # Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
    sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
    edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)

    # (filling NaN's with -1)
    l = edate.asof(df1.date).fillna(-1)
    r = sdate.asof(df1.date + (pd.timedelta(10, 'm') - epsilon)).fillna(-1)
    # (taking `values` here to skip indexes, which are different)
    mask = l.values < r.values

    return df1[mask]

def parfait(df1, df2):
    df1['key'] = 1
    df2['key'] = 1
    df2['row'] = df2.index.values

    # CROSS JOIN
    df3 = pd.merge(df1, df2, on=['key'])

    # DF FILTERING
    return df3[df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9, seconds=59), inclusive=True) | df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]

def root_searchsorted_modified(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # ---- further is the MODIFIED code ----
    # Filter df1 to only overlapping intervals.
    df1.query('(date <= @s1.end_date.values) |\
               (date_end <= @s1.end_date.values) |\
               (@s1.index.values != @s2.index.values)', inplace=True)

    # Drop the extra 'date_end' column.
    return df1.drop('date_end', axis=1)

我得到以下时间：

%timeit root_searchsorted(df1.copy(), df2.copy())
100 loops best of 3: 9.55 ms per loop

%timeit root_searchsorted_modified(df1.copy(), df2.copy())
100 loops best of 3: 13.5 ms per loop

%timeit ptrj(df1.copy(), df2.copy())
100 loops best of 3: 18.5 ms per loop

%timeit root_intervaltree(df1.copy(), df2.copy())
1 loop best of 3: 4.02 s per loop

%timeit parfait(df1.copy(), df2.copy())
1 loop best of 3: 8.96 s per loop

python – Pandas：根据另一个DF选择DF行

相关推荐