Dask Dataframe：重新采样从多个镶木地板文件加载的分区数据

如何解决Dask Dataframe：重新采样从多个镶木地板文件加载的分区数据

我正在加载多个包含时间序列数据的镶木地板文件。但是加载的 dask 数据帧具有未知分区，因此我无法对其应用各种时间序列操作。

df = dd.read_parquet('/path/to/*.parquet',index='Timestamps)

例如，df_resampled = df.resample('1T').mean().compute() 给出以下错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-8e6f7f4340fd> in <module>
      1 df = dd.read_parquet('/path/to/*.parquet',index='Timestamps')
----> 2 df_resampled = df.resample('1T').mean().compute()

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in resample(self,rule,closed,label)
   2627         from .tseries.resample import Resampler
   2628 
-> 2629         return Resampler(self,closed=closed,label=label)
   2630 
   2631     @derived_from(pd.DataFrame)

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/tseries/resample.py in __init__(self,obj,**kwargs)
    118                 "for more information."
    119             )
--> 120             raise ValueError(msg)
    121         self.obj = obj
    122         self._rule = pd.tseries.frequencies.to_offset(rule)

ValueError: Can only resample dataframes with kNown divisions
See https://docs.dask.org/en/latest/dataframe-design.html#partitions
for more information.

我转到了链接：https://docs.dask.org/en/latest/dataframe-design.html#partitions，它说，

In these cases (when divisions are unkNown),any operation that requires a cleanly partitioned DataFrame with kNown divisions will have to perform a sort. This can generally achieved by calling df.set_index(...).

然后我尝试跟随，但没有成功。

df = dd.read_parquet('/path/to/*.parquet')
df = df.set_index('Timestamps')

此步骤抛出以下错误：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-468e9af0c4d6> in <module>
      1 df = dd.read_parquet(os.path.join(OUTPUT_data_dir,'20*.gzip'))
----> 2 df.set_index('Timestamps')
      3 # df_resampled = df.resample('1T').mean().compute()

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in set_index(***Failed resolving arguments***)
   3915                 npartitions=npartitions,3916                 divisions=divisions,-> 3917                 **kwargs,3918             )
   3919 

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/shuffle.py in set_index(df,index,npartitions,shuffle,compute,drop,upsample,divisions,partition_size,**kwargs)
    483     if divisions is None:
    484         sizes = df.map_partitions(sizeof) if repartition else []
--> 485         divisions = index2._repartition_quantiles(npartitions,upsample=upsample)
    486         mins = index2.map_partitions(M.min)
    487         maxes = index2.map_partitions(M.max)

~/.conda/envs/suf/lib/python3.7/site-packages/dask/dataframe/core.py in __getattr__(self,key)
   3755             return self[key]
   3756         else:
-> 3757             raise AttributeError("'DataFrame' object has no attribute %r" % key)
   3758 
   3759     def __dir__(self):

AttributeError: 'DataFrame' object has no attribute '_repartition_quantiles'

有人可以建议将多个时间序列文件加载为 dask 数据帧的正确方法是什么，可以应用熊猫的时间序列操作吗？