如何解决hdfs distcp无法从hdfs复制到s3
我们在内部登台节点中配置了一个端点为http://10.91.16.213:8080
的雪球。一切正常,我什至可以通过s3 cli命令在此滚雪球中列出文件
aws s3 ls my-bucket/data/ --endpoint-url=http://10.91.16.213:8080`
现在,我正在尝试使用hadoop distcp命令将数据从hdfs复制到s3雪球。首先,我测试了hadoop distcp命令,将一些文件复制到我的aws帐户中的实际s3测试桶中,
hadoop distcp \
-Dfs.s3a.fast.upload=true \
-Dfs.s3a.access.key=AKIAUPWDYDZTSGWUWJWN \
-Dfs.s3a.secret.key=<my-secret> \
hdfs://path/to/data/ \
s3a://test-bucket-anum/
以上命令执行良好,并在hadoop集群中启动复制作业。现在,要复制到我的内部雪球中,我要做的就是更改端点。这是我的尝试;
hadoop distcp \
-Dfs.s3a.endpoint=http://10.91.16.213:8080 \
-Dfs.s3a.fast.upload=true \
-Dfs.s3a.access.key=AKIACEMGMYDQNJXGQ2DEOBXG42SQCFR2ZJFTDED3HX3KLVTLOIN6AH3FSDHUF \
-Dfs.s3a.secret.key=<snowball-secret> \
hdfs://path/to/data/ \
s3a://my-bucket/
以上命令失败,并出现以下错误;
20/09/02 19:20:22 INFO s3a.S3AFileSystem: Caught an AmazonClientException,which means the client encountered a serIoUs internal problem while trying to communicate with S3,such as not being able to access the network.
20/09/02 19:20:22 INFO s3a.S3AFileSystem: Error Message: {}com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200,Response Text: OK
com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200,Response Text: OK
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:604)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:962)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at org.apache.hadoop.tools.distCp.setTargetPathExists(distCp.java:217)
at org.apache.hadoop.tools.distCp.run(distCp.java:116)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.tools.distCp.main(distCp.java:430)
Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketobjectsResponse(XmlResponsesSaxParser.java:279)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:75)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:72)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712)
... 12 more
Caused by: java.lang.RuntimeException: Invalid value for IsTruncated field:
true
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler.endElement(XmlResponsesSaxParser.java:647)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(UnkNown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(UnkNown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentdispatcher.dispatch(UnkNown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(UnkNown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(UnkNown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(UnkNown Source)
at org.apache.xerces.parsers.XMLParser.parse(UnkNown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(UnkNown Source)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141)
... 18 more
20/09/02 19:20:22 ERROR tools.distCp: Invalid arguments:
com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200,Response Text: OK
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:604)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:962)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at org.apache.hadoop.tools.distCp.setTargetPathExists(distCp.java:217)
at org.apache.hadoop.tools.distCp.run(distCp.java:116)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.tools.distCp.main(distCp.java:430)
Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketobjectsResponse(XmlResponsesSaxParser.java:279)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:75)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:72)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712)
... 12 more
Caused by: java.lang.RuntimeException: Invalid value for IsTruncated field:
true
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler.endElement(XmlResponsesSaxParser.java:647)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(UnkNown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(UnkNown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentdispatcher.dispatch(UnkNown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(UnkNown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(UnkNown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(UnkNown Source)
at org.apache.xerces.parsers.XMLParser.parse(UnkNown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(UnkNown Source)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141)
... 18 more
Invalid arguments: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200,Response Text: OK
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-append Reuse existing data in target files and append new
data to them if possible
-async Should distcp execution be blocking
-atomic Commit all changes or none
-bandwidth <arg> Specify bandwidth per map in MB
-delete Delete from target,files missing in source
-diff <arg> Use snapshot diff report to identify the
difference between source and target
-f <arg> List of files that need to be copied
-filelimit <arg> (Deprecated!) Limit number of files copied to <= n
-i Ignore failures during copy
-log <arg> Folder on DFS where distcp execution logs are
saved
-m <arg> Max number of concurrent maps to use for copy
-mapredSslConf <arg> Configuration for ssl config file,to use with
hftps://
-overwrite Choose to overwrite target files unconditionally,even if they exist.
-p <arg> preserve status (rbugpcaxt)(replication,block-size,user,group,permission,checksum-type,ACL,XATTR,timestamps). If -p is
specified with no <arg>,then preserves
replication,block size,checksum type and timestamps. raw.* xattrs are
preserved when both the source and destination
paths are in the /.reserved/raw hierarchy (HDFS
only). raw.* xattrpreservation is independent of
the -p flag. Refer to the distCp documentation for
more details.
-sizelimit <arg> (Deprecated!) Limit number of files copied to <= n
bytes
-skipcrccheck Whether to skip CRC checks between source and
target paths.
-strategy <arg> copy strategy to use. Default is dividing work
based on file sizes
-tmp <arg> Intermediate work path to be used for atomic
commit
-update Update target,copying only missingfiles or
directories
这里还有我尝试过的其他hadoop配置,没有运气。
-Dfs.s3a.connection.ssl.enabled=false
:由于我的端点是http。
-Dfs.s3a.region=eu-west-1
我想念什么吗?
更新:
由于错误消息还包含Invalid arguments:
,我想可能是在args中提供了一些无效的字符,因此我尝试按以下方式在/etc/hadoop/conf/core-site.xml
中编写这些选项;
<property>
<name>fs.s3a.endpoint</name>
<value>http://10.91.16.213:8080</value>
</property>
<property>
<name>fs.s3a.fast.upload</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>AKIACEMGMYDQNJXGQ2DEOBXG42SQCFR2ZJFTDED3HX3KLVTLOIN6AH3FSDHUF</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value><snowball-secret></value>
</property>
但有相同的错误消息:(
更新2:
阅读this后,执行ListObjects时看起来像s3 xml解析问题。 AWS客户端具有此选项.withEncodingType("url");
,但找不到hadoop distcp的任何类似内容。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。