如何解决Dask Dataframe:重新采样从多个镶木地板文件加载的分区数据
我正在加载多个包含时间序列数据的镶木地板文件。但是加载的 dask 数据帧具有未知分区,因此我无法对其应用各种时间序列操作。
df = dd.read_parquet('/path/to/*.parquet',index='Timestamps)
例如,df_resampled = df.resample('1T').mean().compute()
给出以下错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-8e6f7f4340fd> in <module>
1 df = dd.read_parquet('/path/to/*.parquet',index='Timestamps')
----> 2 df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in resample(self,rule,closed,label)
2627 from .tseries.resample import Resampler
2628
-> 2629 return Resampler(self,closed=closed,label=label)
2630
2631 @derived_from(pd.DataFrame)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/tseries/resample.py in __init__(self,obj,**kwargs)
118 "for more information."
119 )
--> 120 raise ValueError(msg)
121 self.obj = obj
122 self._rule = pd.tseries.frequencies.to_offset(rule)
ValueError: Can only resample dataframes with kNown divisions
See https://docs.dask.org/en/latest/dataframe-design.html#partitions
for more information.
我转到了链接:https://docs.dask.org/en/latest/dataframe-design.html#partitions,它说,
In these cases (when divisions are unkNown),any operation that requires a cleanly partitioned DataFrame with kNown divisions will have to perform a sort. This can generally achieved by calling df.set_index(...).
然后我尝试跟随,但没有成功。
df = dd.read_parquet('/path/to/*.parquet')
df = df.set_index('Timestamps')
此步骤抛出以下错误:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-4-468e9af0c4d6> in <module>
1 df = dd.read_parquet(os.path.join(OUTPUT_data_dir,'20*.gzip'))
----> 2 df.set_index('Timestamps')
3 # df_resampled = df.resample('1T').mean().compute()
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in set_index(***Failed resolving arguments***)
3915 npartitions=npartitions,3916 divisions=divisions,-> 3917 **kwargs,3918 )
3919
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/shuffle.py in set_index(df,index,npartitions,shuffle,compute,drop,upsample,divisions,partition_size,**kwargs)
483 if divisions is None:
484 sizes = df.map_partitions(sizeof) if repartition else []
--> 485 divisions = index2._repartition_quantiles(npartitions,upsample=upsample)
486 mins = index2.map_partitions(M.min)
487 maxes = index2.map_partitions(M.max)
~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in __getattr__(self,key)
3755 return self[key]
3756 else:
-> 3757 raise AttributeError("'DataFrame' object has no attribute %r" % key)
3758
3759 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute '_repartition_quantiles'
有人可以建议将多个时间序列文件加载为 dask 数据帧的正确方法是什么,可以应用熊猫的时间序列操作吗?
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。