如何解决“错误的输入路径”在单节点 EC2 实例上设置一个简单的 MRJob
我正在尝试使用 Hadoop 和 mrjob
在 Python 中运行一个简单的字数统计程序。我在单个 t2.micro EC2 实例上安装了伪分布式 Hadoop 2.7.3。程序运行如下:
python mr_word_count.py -r hadoop hdfs:///user/ubuntu/input/lorem.txt -o output
但它失败并出现以下错误:
Using configs in /home/ubuntu/.mrjob.conf
Looking for hadoop binary in /home/ubuntu/hadoop/hadoop-2.7.3/bin...
Found hadoop binary: /home/ubuntu/hadoop/hadoop-2.7.3/bin/hadoop
Using Hadoop version 2.7.3
Creating temp directory /tmp/mr_word_count.ubuntu.20210403.013125.236375
uploading working dir files to hdfs:///user/ubuntu/tmp/mrjob/mr_word_count.ubuntu.20210403.013125.236375/files/wd...
copying other local files to hdfs:///user/ubuntu/tmp/mrjob/mr_word_count.ubuntu.20210403.013125.236375/files/
Running step 1 of 1...
Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
session.id is deprecated. Instead,use dfs.metrics.session-id
Initializing JVM Metrics with processName=JobTracker,sessionId=
Cannot initialize JVM Metrics with processName=JobTracker,sessionId= - already initialized
Cleaning up the staging area file:/tmp/mapred/staging/ubuntu1155540475/.staging/job_local1155540475_0001
Error launching job,bad input path : File does not exist: /tmp/mapred/staging/ubuntu1155540475/.staging/job_local1155540475_0001/files/mr_word_count.py#mr_word_count.py
Streaming Command Failed!
Attempting to fetch counters from logs...
Can't fetch history log; missing job ID
No counters found
Scanning logs for probable cause of failure...
Can't fetch history log; missing job ID
Can't fetch task logs; missing application ID
Step 1 of 1 Failed: Command '['/home/ubuntu/hadoop/hadoop-2.7.3/bin/hadoop','jar','/home/ubuntu/hadoop/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar','-files','hdfs:///user/ubuntu/tmp/mrjob/mr_word_count.ubuntu.20210403.013125.236375/files/wd/mr_word_count.py#mr_word_count.py,hdfs:///user/ubuntu/tmp/mrjob/mr_word_count.ubuntu.20210403.013125.236375/files/wd/mrjob.zip#mrjob.zip,hdfs:///user/ubuntu/tmp/mrjob/mr_word_count.ubuntu.20210403.013125.236375/files/wd/setup-wrapper.sh#setup-wrapper.sh','-input','hdfs:///user/ubuntu/input/lorem.txt','-output','hdfs:///user/ubuntu/output','-mapper','/bin/sh -ex setup-wrapper.sh python3 mr_word_count.py --step-num=0 --mapper','-combiner','/bin/sh -ex setup-wrapper.sh python3 mr_word_count.py --step-num=0 --combiner','-reducer','/bin/sh -ex setup-wrapper.sh python3 mr_word_count.py --step-num=0 --reducer']' returned non-zero exit status 512.
似乎跑步者应该将我的程序复制到 /tmp/mapred/staging/,但不是,所以我怀疑我在某处丢失了配置。 Python代码只是本地的,输入文件在HDFS中。
我在这里看到了一堆几乎相同的错误问题(特别是 this 和 this),但是对配置 xml 的任何更改都没有修复错误。如果我在本地 (-r local
) 或内联 (-r inline
) 模式下运行它,它会起作用,但不能在 Hadoop 运行器 (-r hadoop
) 下运行。
这是我要运行的程序:https://gist.github.com/k4v/5d0d1425977fe7e228e7a1e538f72d68
Hadoop 配置文件:
- core-site.xml
- hdfs-site.xml
- mapred-site.xml(我没有使用 Yarn,因为它会导致任何 mapreduce 作业挂在机器的 1 GB RAM 上)
正在运行以下进程:
$ jps
23283 Jps
21846 NodeManager
21545 SecondaryNameNode
21674 ResourceManager
21325 Datanode
21149 NameNode
请帮助找出我遗漏了什么。谢谢。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。