如何解决运行连接组件示例时无法设置检查点目录
这是 graphframe
的 Connected Components example:
from graphframes.examples import Graphs
g = Graphs(sqlContext).friends() # Get example graph
result = g.connectedComponents()
result.select("id","component").orderBy("component").show()
在文件中,他们说:
注意:对于 GraphFrames 0.3.0 及更高版本,默认的 Connected Components 算法需要设置 Spark 检查点目录。用户可以使用 connectedComponents.setAlgorithm("graphx") 恢复到旧算法。
这是我的完整代码 connected.py
和 setCheckpointDir
:
import pyspark
sc = pyspark.SparkContext().getorCreate()
sc.addPyFile("/home/username/.ivy2/jars/graphframes_graphframes-0.8.1-spark3.0-s_2.12.jar")
from graphframes.examples import Graphs
sc.setCheckpointDir("graphframes_cps")
g = Graphs(sqlContext).friends() # Get example graph
result = g.connectedComponents()
result.select("id","component").orderBy("component").show()
并使用此命令运行:
spark-submit connected.py --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12
然后它返回这个错误:
Traceback (most recent call last):
File "/home/username//test/spark/connected.py",line 11,in <module>
sc.setCheckpointDir("graphframes_cps")
File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py",line 975,in setCheckpointDir
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",line 1304,in __call__
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",line 326,in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o19.setCheckpointDir.
我该如何解决这个问题?
解决方法
运行由 graphframes
连接的组件示例时:
from graphframes.examples import Graphs
g = Graphs(sqlContext).friends() # Get example graph
result = g.connectedComponents()
result.select("id","component").orderBy("component").show()
我会收到这个错误:
java.io.IOException: Checkpoint directory is not set. Please set it first using sc.setCheckpointDir().
这意味着我还没有设置 checkpointDir
。然后添加该行:
sc.setCheckpointDir(dirName="/home/username/graphframes_cps")
result = g.connectedComponents()
result.select("id","component").orderBy("component").show()
我得到的错误是:
Traceback (most recent call last):
File "/home/username//test/spark/connected.py",line 11,in <module>
sc.setCheckpointDir("graphframes_cps")
File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py",line 975,in setCheckpointDir
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",line 1304,in __call__
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",line 326,in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o19.setCheckpointDir.
Py4JJavaError: An error occurred while calling o176.setCheckpointDir.
: java.net.ConnectException: Call From huycomputer/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
下面还有我没有注意到的其他错误行,但这是根本问题。我没有启动 HDFS,所以 pyspark
无法连接到 localhost:9000
,这是 HDFS 服务端口。
所以在我运行 start-dfs.sh
后,它按预期工作。但是我仍然不知道如何使用本地文件夹。这是 localhost:9870
并且在我运行该示例几次后位于 "/home/username/graphframes_cps"
路径中。
这是我的完整代码,我使用 Jupyter Notebook 所以它已经启动了一个 SparkContext
,我只需要使用 sc
变量来运行 setCheckpointDir()
:
from graphframes.examples import Graphs
g = Graphs(sqlContext).friends() # Get example graph
sc.setCheckpointDir(dirName="/home/dhuy237/graphframes_cps")
result = g.connectedComponents()
result.select("id","component").orderBy("component").show()
输出:
+---+------------+
| id| component|
+---+------------+
| b|412316860416|
| c|412316860416|
| e|412316860416|
| f|412316860416|
| d|412316860416|
| a|412316860416|
+---+------------+
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。