如何解决大熊猫将不均匀的每小时数据重新采样到1D或24h箱中
我有每周的每小时FX数据,需要从星期一到星期四12:00 pm和星期五的21:00重新采样到“ 1D”或“ 24hr”箱中,每周总计5天:
Date rate
2020-01-02 00:00:00 0.673355
2020-01-02 01:00:00 0.67311
2020-01-02 02:00:00 0.672925
2020-01-02 03:00:00 0.67224
2020-01-02 04:00:00 0.67198
2020-01-02 05:00:00 0.67223
2020-01-02 06:00:00 0.671895
2020-01-02 07:00:00 0.672175
2020-01-02 08:00:00 0.672085
2020-01-02 09:00:00 0.67087
2020-01-02 10:00:00 0.6705800000000001
2020-01-02 11:00:00 0.66884
2020-01-02 12:00:00 0.66946
2020-01-02 13:00:00 0.6701600000000001
2020-01-02 14:00:00 0.67056
2020-01-02 15:00:00 0.67124
2020-01-02 16:00:00 0.6691699999999999
2020-01-02 17:00:00 0.66883
2020-01-02 18:00:00 0.66892
2020-01-02 19:00:00 0.669345
2020-01-02 20:00:00 0.66959
2020-01-02 21:00:00 0.670175
2020-01-02 22:00:00 0.6696300000000001
2020-01-02 23:00:00 0.6698350000000001
2020-01-03 00:00:00 0.66957
因此,一周中各天的小时数是不均匀的,即“星期一” =周一00:00:00到周一12:00:00,“星期二”(以及周三,周四)=即13星期一:00:00到星期二12:00:00,星期五= 13:00:00到21:00:00
在尝试找到解决方案时,我发现现在不推荐使用基数,并且偏移量/原点方法无法按预期运行,这可能是由于每天的行数不均匀所致:
df.rate.resample('24h',offset=12).ohlc()
我花了数小时试图找到解决方法
如何简单地将每个12:00:00时间戳之间的所有数据行装到ohlc()列中?
所需的输出看起来像这样:
Out[69]:
open high low close
2020-01-02 00:00:00.0000000 0.673355 0.673355 0.673355 0.673355
2020-01-03 00:00:00.0000000 0.673110 0.673110 0.668830 0.669570
2020-01-04 00:00:00.0000000 0.668280 0.668280 0.664950 0.666395
2020-01-05 00:00:00.0000000 0.666425 0.666425 0.666425 0.666425
解决方法
使用原点和偏移量作为参数,这就是您要查找的内容
df.resample('24h',origin='start_day',offset='13h').ohlc()
以您的示例为例:
open high low close
datetime
2020-01-01 13:00:00 0.673355 0.673355 0.66884 0.66946
2020-01-02 13:00:00 0.670160 0.671240 0.66883 0.66957
,
由于周期长度不相等,因此IMO必须自己制作映射轮。准确地说,星期一的1.5天长度使得freq='D'
无法一次正确进行映射。
手工编写的代码还可以防止在定义明确的时间段之外进行记录。
数据
使用稍有不同的时间戳来演示代码的正确性。这些日子是星期一。到星期五。
import pandas as pd
import numpy as np
from datetime import datetime
import io
from pandas import Timestamp,Timedelta
df = pd.read_csv(io.StringIO("""
rate
Date
2020-01-06 00:00:00 0.673355
2020-01-06 23:00:00 0.673110
2020-01-07 00:00:00 0.672925
2020-01-07 12:00:00 0.672240
2020-01-07 13:00:00 0.671980
2020-01-07 23:00:00 0.672230
2020-01-08 00:00:00 0.671895
2020-01-08 12:00:00 0.672175
2020-01-08 23:00:00 0.672085
2020-01-09 00:00:00 0.670870
2020-01-09 12:00:00 0.670580
2020-01-09 23:00:00 0.668840
2020-01-10 00:00:00 0.669460
2020-01-10 12:00:00 0.670160
2020-01-10 21:00:00 0.670560
2020-01-10 22:00:00 0.671240
2020-01-10 23:00:00 0.669170
"""),sep=r"\s{2,}",engine="python")
df.set_index(pd.to_datetime(df.index),inplace=True)
代码
def find_day(ts: Timestamp):
"""Find the trading day with irregular length"""
wd = ts.isoweekday()
if wd == 1:
return ts.date()
elif wd in (2,3,4):
return ts.date() - Timedelta("1D") if ts.hour <= 12 else ts.date()
elif wd == 5:
if ts.hour <= 12:
return ts.date() - Timedelta("1D")
elif 13 <= ts.hour <= 21:
return ts.date()
# out of range or nulls
return None
# map the timestamps,and set as new index
df.set_index(pd.DatetimeIndex(df.index.map(find_day)),inplace=True)
# drop invalid values and collect ohlc
ans = df["rate"][df.index.notnull()].resample("D").ohlc()
结果
print(ans)
open high low close
Date
2020-01-06 0.673355 0.673355 0.672240 0.672240
2020-01-07 0.671980 0.672230 0.671895 0.672175
2020-01-08 0.672085 0.672085 0.670580 0.670580
2020-01-09 0.668840 0.670160 0.668840 0.670160
2020-01-10 0.670560 0.670560 0.670560 0.670560
,
我最终结合使用了grouby和星期几的日期时间来确定我的具体解决方案
# get idxs of time to rebal (12:00:00)-------------------------------------
df['idx'] = range(len(df)) # get row index
days = [] # identify each row by day of week
for i in range(len(df.index)):
days.append(df.index[i].date().weekday())
df['day'] = days
dtChgIdx = [] # stores "12:00:00" rows
justDates = df.index.date.tolist() # gets just dates
res = [] # removes duplicate dates
[res.append(x) for x in justDates if x not in res]
justDates = res
grouped_dates = df.groupby(df.index.date) # group entire df by dates
for i in range(len(grouped_dates)):
tempDf = grouped_dates.get_group(justDates[i]) # look at each grouped dates
if tempDf['day'][0] == 6:
continue # skip Sundays
times = [] # gets just the time portion of index
for y in range(len(tempDf.index)):
times.append(str(tempDf.index[y])[-8:])
tempDf['time'] = times # add time column to df
tempDf['dayCls'] = np.where(tempDf['time'] == '12:00:00',1,0) # idx "12:00:00" row
dtChgIdx.append(tempDf.loc[tempDf['dayCls'] == 1,'idx'][0]) # idx value
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。