如何解决在Python在内存中中将分区的Parquet文件读取到DataFame，其中列类型是数组数组

上下文

我在S3中对Parquet文件进行了分区。我想将它们读取并连接到DataFrame中，以便可以查询和查看（内存中的）数据。到目前为止，我已经做到了，但是具有类型（array >）的列数据之一被转换为 None 。 其他列（例如str，int数组等）已正确转换。我不确定在此过程中缺少什么。我想象在此转换过程中会丢失数据，或者数据在那里并且我的查询方法是错误的。

我到目前为止所做的步骤

import s3fs
import fastparquet as fp
import pandas as pd

key = 'MyAWSKey'
secret = 'MyAWSSecret'
token = 'MyAWSToken'

s3_file_system = s3fs.S3FileSystem(secret= secret,token=token,key=key)
file_names = s3_file_system.glob(path='s3://.../*.snappy.parquet')

# <class 'fastparquet.api.ParquetFile'>
fp_api_parquetfile_obj = fp.ParquetFile(files,open_with= s3_file_system.open) 

data = fp_api_parquetfile_obj.to_pandas()

查询结果

# column A type is array of array of doubles
print(pd.Series(data['A']).head(10))
# Prints 10 rows of None! [Incorrect]

# column B type is array of int
print(pd.Series(data['B']).head(10))
# Prints 10 rows of array of int values correctly

# column C type is string
print(pd.Series(data['C']).head(10))
# Prints 10 rows of str values correctly

请注意文件中存在数据（双精度数组），因为我可以使用Athena进行查询。

解决方法

我找不到任何方法来 fastparquet 读取数组列的数组；相反，我使用了另一个库（pyarrow），它有效！

import s3fs
import pandas as pd
import pyarrow.parquet as pq

key = 'MyAWSKey'
secret = 'MyAWSSecret'
token = 'MyAWSToken'

s3_file_system = s3fs.S3FileSystem(secret= secret,token=token,key=key)
file_names = s3_file_system.glob(path='s3://.../*.snappy.parquet')

data_frames = [pq.ParquetDataset('s3://' + f,filesystem= s3_file_system).read_pandas().to_pandas() for f in files]

data = pd.concat(data_frames,ignore_index=True)

# column A type is array of array of doubles
print(pd.Series(data['A']).head(10))
# Prints 10 rows of array of arrays correctly

在Python在内存中中将分区的Parquet文件读取到DataFame，其中列类型是数组数组

如何解决在Python在内存中中将分区的Parquet文件读取到DataFame，其中列类型是数组数组

上下文

我到目前为止所做的步骤

查询结果

解决方法

相关推荐