微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

Dask Dataframe:重新采样从多个镶木地板文件加载的分区数据

如何解决Dask Dataframe:重新采样从多个镶木地板文件加载的分区数据

我正在加载多个包含时间序列数据的镶木地板文件。但是加载的 dask 数据帧具有未知分区,因此我无法对其应用各种时间序列操作。

df = dd.read_parquet('/path/to/*.parquet',index='Timestamps)

例如,df_resampled = df.resample('1T').mean().compute() 给出以下错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-8e6f7f4340fd> in <module>
      1 df = dd.read_parquet('/path/to/*.parquet',index='Timestamps')
----> 2 df_resampled = df.resample('1T').mean().compute()

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in resample(self,rule,closed,label)
   2627         from .tseries.resample import Resampler
   2628 
-> 2629         return Resampler(self,closed=closed,label=label)
   2630 
   2631     @derived_from(pd.DataFrame)

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/tseries/resample.py in __init__(self,obj,**kwargs)
    118                 "for more information."
    119             )
--> 120             raise ValueError(msg)
    121         self.obj = obj
    122         self._rule = pd.tseries.frequencies.to_offset(rule)

ValueError: Can only resample dataframes with kNown divisions
See https://docs.dask.org/en/latest/dataframe-design.html#partitions
for more information.

我转到了链接https://docs.dask.org/en/latest/dataframe-design.html#partitions,它说,

In these cases (when divisions are unkNown),any operation that requires a cleanly partitioned DataFrame with kNown divisions will have to perform a sort. This can generally achieved by calling df.set_index(...).

然后我尝试跟随,但没有成功。

df = dd.read_parquet('/path/to/*.parquet')
df = df.set_index('Timestamps')

此步骤抛出以下错误

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-468e9af0c4d6> in <module>
      1 df = dd.read_parquet(os.path.join(OUTPUT_data_dir,'20*.gzip'))
----> 2 df.set_index('Timestamps')
      3 # df_resampled = df.resample('1T').mean().compute()

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in set_index(***Failed resolving arguments***)
   3915                 npartitions=npartitions,3916                 divisions=divisions,-> 3917                 **kwargs,3918             )
   3919 

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/shuffle.py in set_index(df,index,npartitions,shuffle,compute,drop,upsample,divisions,partition_size,**kwargs)
    483     if divisions is None:
    484         sizes = df.map_partitions(sizeof) if repartition else []
--> 485         divisions = index2._repartition_quantiles(npartitions,upsample=upsample)
    486         mins = index2.map_partitions(M.min)
    487         maxes = index2.map_partitions(M.max)

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in __getattr__(self,key)
   3755             return self[key]
   3756         else:
-> 3757             raise AttributeError("'DataFrame' object has no attribute %r" % key)
   3758 
   3759     def __dir__(self):

AttributeError: 'DataFrame' object has no attribute '_repartition_quantiles'

有人可以建议将多个时间序列文件加载为 dask 数据帧的正确方法是什么,可以应用熊猫的时间序列操作吗?

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。