微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何从 ORC 源文件设置 Athena int 数组列?

如何解决如何从 ORC 源文件设置 Athena int 数组列?

我有一个 ORC 文件,其中包含两列(可变长度)整数列表,我想将其作为 sql 数组访问。该文件通过本机 ORC 工具的所有测试。我可以将表格创建为

CREATE EXTERNAL TABLE atoms(
    EMOL_VERSION_ID VARCHAR(5),label INT,atype SMALLINT,element tinyint,isotope SMALLINT,charge tinyint,valence tinyint,degree tinyint,hydrogencount tinyint,neighborcount tinyint,xneighborcount tinyint,aroringcount SMALLINT,ringcount SMALLINT,ringsizes BIGINT,bonds ARRAY <INT>,ligands ARRAY <INT>,molnumber SMALLINT
) STORED AS ORC LOCATION 's3://xathena/structures/atoms/' TBLPROPERTIES ("orc.compress" = "ZLIB");
MSCK REPAIR TABLE atoms;

两个语句都没有任何错误或警告执行。但随后自动生成的表格预览

SELECT * FROM "structures"."atoms" limit 10;

失败

您的查询有以下错误

HIVE_BAD_DATA:ORC 中字段bond(0) 的类型INT 与表模式中定义的类型数组不兼容 此查询针对“结构”数据库运行,除非由查询限定。请在我们的论坛上发布错误消息或联系客户支持并使用查询 ID:5aea88d3-5a98-4137-8cc7-6d8fd7a66547。

我有其他没有由同一软件编写的 int 列表的 ORC 文件,它们在 Athena 中工作正常,所以我很确定文件格式不是问题。

原始ORC工具显示文件结构是

orc-Metadata atom.orc

{ "name": "atoms.orc","type": "struct<EMOL_VERSION_ID:string,label:int,atype:smallint,element:tinyint,isotope:smallint,charge:tinyint,valence:tinyint,degree:tinyint,hydrogencount:tinyint,neighborcount:tinyint,xneighborcount:tinyint,aroringcount:smallint,ringcount:smallint,ringsizes:bigint,bonds:array<int>,ligands:array<int>,molnumber:smallint>","rows": 13742,"stripe count": 2,"format": "0.12","writer version": "original","compression": "zlib","compression block": 262144,"file length": 63618,"content": 62890,"stripe stats": 302,"footer": 409,"postscript": 16,"row index stride": 10000,"user Metadata": {
},"stripes": [
{ "stripe": 0,"rows": 10000,"offset": 3,"length": 44482,"index": 451,"data": 43860,"footer": 171
},{ "stripe": 1,"rows": 3742,"offset": 44485,"length": 18405,"index": 447,"data": 17784,"footer": 174
}
]
}

orc-contents atom.orc |头-3

{"EMOL_VERSION_ID": "9986","label": 1,"atype": 1,"element": 6,"isotope": 0,"charge": 0,"valence": 4,"degree": 3,"hydrogencount": 0,"neighborcount": 3,"xneighborcount": 2,"aroringcount": 0,"ringcount": 0,"ringsizes": 0,"bonds": [1,2,3],"ligands": [2,9,10],"molnumber": 1}
{"EMOL_VERSION_ID": "9986","label": 2,"valence": 2,"degree": 2,"neighborcount": 2,"xneighborcount": 0,4],"ligands": [1,"label": 3,"bonds": [4,5],"molnumber": 1}

数组定义和底层 ORC 文件数据肯定都是 INT 元素类型。我不明白这条消息以及如何处理它。有人可以帮忙吗?

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。