“错误的输入路径”在单节点 EC2 实例上设置一个简单的 MRJob

如何解决“错误的输入路径”在单节点 EC2 实例上设置一个简单的 MRJob

我正在尝试使用 Hadoop 和 mrjob 在 Python 中运行一个简单的字数统计程序。我在单个 t2.micro EC2 实例上安装了伪分布式 Hadoop 2.7.3。程序运行如下：

python mr_word_count.py -r hadoop hdfs:///user/ubuntu/input/lorem.txt  -o output

但它失败并出现以下错误：

Using configs in /home/ubuntu/.mrjob.conf
Looking for hadoop binary in /home/ubuntu/hadoop/hadoop-2.7.3/bin...
Found hadoop binary: /home/ubuntu/hadoop/hadoop-2.7.3/bin/hadoop
Using Hadoop version 2.7.3
Creating temp directory /tmp/mr_word_count.ubuntu.20210403.013125.236375
uploading working dir files to hdfs:///user/ubuntu/tmp/mrjob/mr_word_count.ubuntu.20210403.013125.236375/files/wd...
copying other local files to hdfs:///user/ubuntu/tmp/mrjob/mr_word_count.ubuntu.20210403.013125.236375/files/
Running step 1 of 1...
  Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  session.id is deprecated. Instead,use dfs.metrics.session-id
  Initializing JVM Metrics with processName=JobTracker,sessionId=
  Cannot initialize JVM Metrics with processName=JobTracker,sessionId= - already initialized
  Cleaning up the staging area file:/tmp/mapred/staging/ubuntu1155540475/.staging/job_local1155540475_0001
  Error launching job,bad input path : File does not exist: /tmp/mapred/staging/ubuntu1155540475/.staging/job_local1155540475_0001/files/mr_word_count.py#mr_word_count.py
  Streaming Command Failed!
Attempting to fetch counters from logs...
Can't fetch history log; missing job ID
No counters found
Scanning logs for probable cause of failure...
Can't fetch history log; missing job ID
Can't fetch task logs; missing application ID
Step 1 of 1 Failed: Command '['/home/ubuntu/hadoop/hadoop-2.7.3/bin/hadoop','jar','/home/ubuntu/hadoop/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar','-files','hdfs:///user/ubuntu/tmp/mrjob/mr_word_count.ubuntu.20210403.013125.236375/files/wd/mr_word_count.py#mr_word_count.py,hdfs:///user/ubuntu/tmp/mrjob/mr_word_count.ubuntu.20210403.013125.236375/files/wd/mrjob.zip#mrjob.zip,hdfs:///user/ubuntu/tmp/mrjob/mr_word_count.ubuntu.20210403.013125.236375/files/wd/setup-wrapper.sh#setup-wrapper.sh','-input','hdfs:///user/ubuntu/input/lorem.txt','-output','hdfs:///user/ubuntu/output','-mapper','/bin/sh -ex setup-wrapper.sh python3 mr_word_count.py --step-num=0 --mapper','-combiner','/bin/sh -ex setup-wrapper.sh python3 mr_word_count.py --step-num=0 --combiner','-reducer','/bin/sh -ex setup-wrapper.sh python3 mr_word_count.py --step-num=0 --reducer']' returned non-zero exit status 512.

似乎跑步者应该将我的程序复制到 /tmp/mapred/staging/，但不是，所以我怀疑我在某处丢失了配置。 Python代码只是本地的，输入文件在HDFS中。

我在这里看到了一堆几乎相同的错误问题（特别是 this 和 this），但是对配置 xml 的任何更改都没有修复错误。如果我在本地 (-r local) 或内联 (-r inline) 模式下运行它，它会起作用，但不能在 Hadoop 运行器 (-r hadoop) 下运行。

这是我要运行的程序：https://gist.github.com/k4v/5d0d1425977fe7e228e7a1e538f72d68

Hadoop 配置文件：

core-site.xml
hdfs-site.xml
mapred-site.xml（我没有使用 Yarn，因为它会导致任何 mapreduce 作业挂在机器的 1 GB RAM 上）

正在运行以下进程：

$ jps
23283 Jps
21846 NodeManager
21545 SecondaryNameNode
21674 ResourceManager
21325 Datanode
21149 NameNode

请帮助找出我遗漏了什么。谢谢。