我正在尝试使用pandas,hdf5构建一个ETL工具包.
我的计划是
>将表从mysql中提取到DataFrame;
>将此DataFrame放入hdfstore;
但是当我正在执行第2步时,我发现将数据帧放入* .h5文件需要花费太多时间.
>源MysqL服务器中表的大小:498MB
> 52列
> 924,624条记录
>’put’操作费用为849.345677137秒
我的问题是:
这个时间成本是否正常?
有没有办法让它更快?
更新1
谢谢Jeff
>我的代码非常简单:
extract_store = hdfstore(‘extract_store.h5’)
extract_store [‘df_staff’] = df_staff
>当我尝试’ptdump -av file.h5’时,我收到了一个错误,但我仍然可以从这个h5文件中加载dataframe对象:
tables.exceptions.HDF5ExtError: HDF5 error back trace
File “../../../src/H5F.c”, line 1512, in H5Fopen
unable to open file File “../../../src/H5F.c”, line 1307, in H5F_open
unable to read superblock File “../../../src/H5Fsuper.c”, line 305, in H5F_super_read
unable to find file signature File “../../../src/H5Fsuper.c”, line 153, in H5F_locate_signature
unable to find a valid file signatureEnd of HDF5 error back trace
Unable to open/create file ‘extract_store.h5’
>其他一些信息:
>熊猫版:’0.10.0′
> os:ubuntu server 10.04 x86_64
> cpu:8 * Intel(R)Xeon(R)cpu X5670 @ 2.93GHz
> MemTotal:51634016 kB
我将把pandas更新为0.10.1-dev并再试一次.
更新2
>我已将熊猫更新为’0.10.1.dev-6e2b6ea’
>但是时间成本没有降低,这次花费884.15秒
>’ptdump -av file.h5’的输出是:
/ (RootGroup) '' /._v_attrs (AttributeSet), 4 attributes: [CLASS := 'GROUP', PYTABLES_FORMAT_VERSION := '2.0', TITLE := '', VERSION := '1.0'] /df_bugs (Group) '' /df_bugs._v_attrs (AttributeSet), 12 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0', axis0_variety := 'regular', axis1_variety := 'regular', block0_items_variety := 'regular', block1_items_variety := 'regular', block2_items_variety := 'regular', nblocks := 3, ndim := 2, pandas_type := 'frame', pandas_version := '0.10.1'] /df_bugs/axis0 (Array(52,)) '' atom := StringAtom(itemsize=19, shape=(), dflt='') maindim := 0 flavor := 'numpy' byteorder := 'irrelevant' chunkshape := None /df_bugs/axis0._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'string', name := None, transposed := True] /df_bugs/axis1 (Array(924624,)) '' atom := Int64Atom(shape=(), dflt=0) maindim := 0 flavor := 'numpy' byteorder := 'little' chunkshape := None /df_bugs/axis1._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'integer', name := None, transposed := True] /df_bugs/block0_items (Array(5,)) '' atom := StringAtom(itemsize=12, shape=(), dflt='') maindim := 0 flavor := 'numpy' byteorder := 'irrelevant' chunkshape := None /df_bugs/block0_items._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'string', name := None, transposed := True] /df_bugs/block0_values (Array(924624, 5)) '' atom := Float64Atom(shape=(), dflt=0.0) maindim := 0 flavor := 'numpy' byteorder := 'little' chunkshape := None /df_bugs/block0_values._v_attrs (AttributeSet), 5 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', transposed := True] /df_bugs/block1_items (Array(19,)) '' atom := StringAtom(itemsize=19, shape=(), dflt='') maindim := 0 flavor := 'numpy' byteorder := 'irrelevant' chunkshape := None /df_bugs/block1_items._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'string', name := None, transposed := True] /df_bugs/block1_values (Array(924624, 19)) '' atom := Int64Atom(shape=(), dflt=0) maindim := 0 flavor := 'numpy' byteorder := 'little' chunkshape := None /df_bugs/block1_values._v_attrs (AttributeSet), 5 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', transposed := True] /df_bugs/block2_items (Array(28,)) '' atom := StringAtom(itemsize=18, shape=(), dflt='') maindim := 0 flavor := 'numpy' byteorder := 'irrelevant' chunkshape := None /df_bugs/block2_items._v_attrs (AttributeSet), 7 attributes: [CLASS := 'ARRAY', FLAVOR := 'numpy', TITLE := '', VERSION := '2.3', kind := 'string', name := None, transposed := True] /df_bugs/block2_values (VLArray(1,)) '' atom = ObjectAtom() byteorder = 'irrelevant' nrows = 1 flavor = 'numpy' /df_bugs/block2_values._v_attrs (AttributeSet), 5 attributes: [CLASS := 'VLARRAY', PSEUDOATOM := 'object', TITLE := '', VERSION := '1.3', transposed := True]
>我在下面尝试了你的代码(将数据框放入hdfstore,param’table’为True),但是却出现了错误,似乎不支持python的datatime类型:
Exception: cannot find the correct atom type -> [dtype->object] object
of type ‘datetime.datetime’ has no len()
更新3
谢谢杰夫.
抱歉耽搁了.
> tables.version:’2.4.0′.
>是的,884秒只是没有来自MysqL的pull操作的put操作成本
>一行数据帧(df.ix [0]):
bug_id 1 assigned_to 185 bug_file_loc None bug_severity critical bug_status closed creation_ts 1998-05-06 21:27:00 delta_ts 2012-05-09 14:41:41 short_desc Two cursors. host_op_sys UnkNown guest_op_sys UnkNown priority P3 rep_platform IA32 reporter 56 product_id 7 category_id 983 component_id 12925 resolution fixed target_milestone ws1 qa_contact 412 status_whiteboard Votes 0 keywords SR lastdiffed 2012-05-09 14:41:41 everconfirmed 1 reporter_accessible 1 cclist_accessible 1 estimated_time 0.00 remaining_time 0.00 deadline None alias None found_in_product_id 0 found_in_version_id 0 found_in_phase_id 0 cf_type Defect cf_reported_by Development cf_attempted NaN cf_Failed NaN cf_public_summary cf_doc_impact 0 cf_security 0 cf_build NaN cf_branch cf_change NaN cf_test_id NaN cf_regression UnkNown cf_reviewer 0 cf_on_hold 0 cf_public_severity --- cf_i18n_impact 0 cf_eta None cf_bug_source --- cf_viss None Name: 0, Length: 52
>数据帧的图片(只需在ipython notebook中输入’df’):
Int64Index: 924624 entries, 0 to 924623 Data columns: bug_id 924624 non-null values assigned_to 924624 non-null values bug_file_loc 427318 non-null values bug_severity 924624 non-null values bug_status 924624 non-null values creation_ts 924624 non-null values delta_ts 924624 non-null values short_desc 924624 non-null values host_op_sys 924624 non-null values guest_op_sys 924624 non-null values priority 924624 non-null values rep_platform 924624 non-null values reporter 924624 non-null values product_id 924624 non-null values category_id 924624 non-null values component_id 924624 non-null values resolution 924624 non-null values target_milestone 924624 non-null values qa_contact 924624 non-null values status_whiteboard 924624 non-null values Votes 924624 non-null values keywords 924624 non-null values lastdiffed 924509 non-null values everconfirmed 924624 non-null values reporter_accessible 924624 non-null values cclist_accessible 924624 non-null values estimated_time 924624 non-null values remaining_time 924624 non-null values deadline 0 non-null values alias 0 non-null values found_in_product_id 924624 non-null values found_in_version_id 924624 non-null values found_in_phase_id 924624 non-null values cf_type 924624 non-null values cf_reported_by 924624 non-null values cf_attempted 89622 non-null values cf_Failed 89587 non-null values cf_public_summary 510799 non-null values cf_doc_impact 924624 non-null values cf_security 924624 non-null values cf_build 327460 non-null values cf_branch 614929 non-null values cf_change 300612 non-null values cf_test_id 12610 non-null values cf_regression 924624 non-null values cf_reviewer 924624 non-null values cf_on_hold 924624 non-null values cf_public_severity 924624 non-null values cf_i18n_impact 924624 non-null values cf_eta 3910 non-null values cf_bug_source 924624 non-null values cf_viss 725 non-null values dtypes: float64(5), int64(19), object(28)
>’convert_objects()’之后:
dtypes: datetime64[ns](2), float64(5), int64(19), object(26)
>并将转换后的数据帧放入hdfstore成本:749.50 s
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。