微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

多个数据帧上的多处理池映射产生 TypeError

如何解决多个数据帧上的多处理池映射产生 TypeError

我有要导入 Pandas 数据帧的文件列表,每个文件至少 100 MB。

# current path of working directory for jupyter notebook and CSV files in Google Colab
file_dir = '/content/drive/MyDrive/New York Bike Share'

# getting file names within the directory and sort file name
file_names = glob(path.join(file_dir,'*-citibike-tripdata.csv'))
file_names

['/content/drive/MyDrive/New York Bike Share/201901-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201902-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201903-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201904-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201905-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201906-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201907-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201908-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201909-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201910-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201911-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201912-citibike-tripdata.csv']

我尝试借助 read_csv 方法中的块大小、usecols 等参数来分解数据帧中的每个文件

cols = [0,1,4,5,6,8,9,10,12,13,14]

col_names = ['duration','time_start','station_name_start','station_latitude_start','station_longitude_start','station_name_end','station_latitude','station_longitude_end','user_type','birth_year','gender']

col_type = {
    'duration': np.int32,'station_latitude_start': np.float32,'station_longitude_start': np.float32,'station_latitude': np.float32,'station_longitude_end': np.float32,'user_type': 'category','birth_year': 'object','gender': 'category'
}

def create_df(file):

    t = pd.read_csv(file,chunksize=100_000,usecols=cols,names=col_names,dtype=col_type,parse_dates=['time_start'],header=0)

    return t

def merge_df(ls):

    f = reduce(lambda a,b: pd.concat([a,b],ignore_index=True),ls)

    return f

所有文件组合将产生 2GB+ 的数据帧,因此我使用 3 个 csv 文件进行了试验。

df_list = []

for f in file_names[0:3]:
    for chunk in create_df(f):
        df_list.append(chunk)

df_list[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries,0 to 99999
Data columns (total 11 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration                 100000 non-null  int32         
 1   time_start               100000 non-null  datetime64[ns]
 2   station_name_start       100000 non-null  object        
 3   station_latitude_start   100000 non-null  float32       
 4   station_longitude_start  100000 non-null  float32       
 5   station_name_end         100000 non-null  object        
 6   station_latitude         100000 non-null  float32       
 7   station_longitude_end    100000 non-null  float32       
 8   user_type                100000 non-null  category      
 9   birth_year               100000 non-null  object        
 10  gender                   100000 non-null  category      
dtypes: category(2),datetime64[ns](1),float32(4),int32(1),object(3)
memory usage: 5.2+ MB

merge_df(df_list).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3238991 entries,0 to 3238990
Data columns (total 11 columns):
 #   Column                   Dtype         
---  ------                   -----         
 0   duration                 int32         
 1   time_start               datetime64[ns]
 2   station_name_start       object        
 3   station_latitude_start   float32       
 4   station_longitude_start  float32       
 5   station_name_end         object        
 6   station_latitude         float32       
 7   station_longitude_end    float32       
 8   user_type                category      
 9   birth_year               object        
 10  gender                   category      
dtypes: category(2),object(3)
memory usage: 166.8+ MB

我尝试在“多处理”池的帮助下加快产生类似结果的过程,但遇到了 TypeError。

from multiprocessing import Pool
pool = Pool(8)

pool.map(merge_df,df_list)

TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid

感谢您对错误的建议。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。