微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

python – 熊猫 – 加入时间接近

我有2个数据帧,left_df和right_df,每个数据帧都有一个对应于datetime的列.我希望以这样的方式加入它们,对于left_df中的每一行,我发现right_df中的行与right_df中所有行中的R最接近,并将它们放在一起.我没有关于left_df或right_df中的行是否排在第一位.

下面给出一个例子:

left_df = 
              left_dt           left_flag
0  2014-08-23 07:57:03.827516   True
1  2014-08-23 09:27:12.831126  False
2  2014-08-23 11:55:27.551029   True
3  2014-08-23 16:11:33.511049   True


right_df =
    right dt                   right_flag 
0   2014-08-23 07:12:52.80587    True
1   2014-08-23 15:12:34.815087   True




desired output_df =

              left_dt           left_flag        right dt               right_flag 
0  2014-08-23 07:57:03.827516   True        2015-08-23 07:12:52.80587      True
1  2014-08-23 09:27:12.831126  False        2015-08-23 07:12:52.80587      True
2  2014-08-23 11:55:27.551029   True        2015-08-23 15:12:34.815087     True
3  2014-08-23 16:11:33.511049   True        2015-08-23 15:12:34.815087     True

解决方法:

我不确定它会在所有情况下都有效.但我认为这可能是一个解决方案.

# Test data
left_df = pd.DataFrame({'left_dt': ['2014-08-23 07:57:03.827516',
  '2014-08-23 09:27:12.831126',
  '2014-08-23 11:55:27.551029',
  '2014-08-23 16:11:33.511049'],
 'left_flag': [True, False, True, True]})
left_df['left_dt'] = pd.to_datetime(left_df['left_dt'])


right_df = pd.DataFrame(
{'right_dt': ['2014-08-23 07:12:52.80587', '2014-08-23 15:12:34.815087'],
 'right_flag': [True, True]})
right_df['right_dt'] = pd.to_datetime(right_df['right_dt'])


# Setting the date as the index for each DataFrame
left_df.set_index('left_dt', drop=False, inplace=True)
right_df.set_index('right_dt', drop=False, inplace=True)

# Merging them and filling the gaps
output_df = left_df.join(right_df, how='outer').sort_index()
output_df.fillna(method='ffill', inplace=True)
# Droping unwanted values from the left
output_df.dropna(subset=['left_dt'], inplace=True)
# Computing a difference to select the right duplicated row to drop (the one with the greates diff)
output_df['diff'] = abs(output_df['left_dt'] - output_df['right_dt'])
output_df.sort(columns='diff', inplace=True)
output_df.drop_duplicates(subset=['left_dt'], inplace=True)
# Bringing back the index
output_df.sort_index(inplace=True)
output_df = output_df.reset_index(drop=True)
# Droping unwanted column
output_df.drop('diff', axis=1, inplace=True)
output_df

                     left_dt left_flag                   right_dt right_flag
0 2014-08-23 07:57:03.827516      True 2014-08-23 07:12:52.805870       True
1 2014-08-23 09:27:12.831126     False 2014-08-23 07:12:52.805870       True
2 2014-08-23 11:55:27.551029      True 2014-08-23 15:12:34.815087       True
3 2014-08-23 16:11:33.511049      True 2014-08-23 15:12:34.815087       True

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐