如何解决运行 pyspark shell 或从 jupyter notebook 时出错
我正在尝试运行 pyspark shell,但在执行时:
(test3.8python) [test@JupyterHub ~]$ python3 /home/test/spark3.1.1/bin/pyspark
我收到以下错误:
File "/home/test/spark3.1.1/bin/pyspark",line 20
if [ -z "${SPARK_HOME}" ]; then
^
SyntaxError: invalid Syntax
我在 ~/.bashrc 中设置了以下内容:
export SPARK_HOME=/home/test/spark3.1.1
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
export PYSPARK_PYTHON=python3
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
如果我尝试从 jupyter notebook 运行它,如下所示:
import pyspark
from pyspark.sql import SparkSession
#starting deamons for standalone
!/home/test/spark3.1.1/sbin/start-master.sh
!/home/test/spark3.1.1/sbin/start-worker.sh spark://JupyterHub:7077
#spark standalone
spark = SparkSession.builder \
.appName("test") \
.master("spark://JupyterHub:7077")\
.config("spark.cores.max","5")\
.config("spark.executor.memory","2g")\
.config("spark.jars.packages",'org.elasticsearch:elasticsearch-spark-30_2.12:7.12-SNAPSHOT')\
.config("spark.executor.cores","5")\
.enableHiveSupport() \
.getorCreate()
我收到以下错误:
ModuleNotFoundError: No module named 'pyspark'
但我不明白为什么,因为我已经在 bash 中指定了我的 spark 文件夹中 python 文件的路径,并确保发生了更改。此外,我正在摆弄并尝试使用库 findspark
,现在如果我尝试使用添加的导入运行此代码:
import findspark
spark_location='/home/test/spark3.1.1/'
findspark.init(spark_home=spark_location)
import pyspark
from pyspark.sql import SparkSession
#starting deamons for standalone
!/home/test/spark3.1.1/sbin/start-master.sh
!/home/test/spark3.1.1/sbin/start-worker.sh spark://JupyterHub:7077
#spark standalone
spark = SparkSession.builder \
.appName("test") \
.master("spark://JupyterHub:7077")\
.config("spark.cores.max",'org.elasticsearch:elasticsearch-spark-30_2.12:7.12.0-SNAPSHOT')\
.config("spark.executor.cores","5")\
.enableHiveSupport() \
.getorCreate()
看起来它能够找到 pyspark,但是这是 0 意义,因为我已经在 bash 文件中指定了所有内容并且已经设置了 SPARK_HOME,但是我收到另一个错误:
starting org.apache.spark.deploy.master.Master,logging to /home/test/spark3.1.1//logs/spark-test-org.apache.spark.deploy.master.Master-1-JupyterHub.out
starting org.apache.spark.deploy.worker.Worker,logging to /home/test/spark3.1.1//logs/spark-test-org.apache.spark.deploy.worker.Worker-1-JupyterHub.out
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-7-7d402e7d71bf> in <module>
10
11 #spark standalone
---> 12 spark = SparkSession.builder \
13 .appName("test") \
14 .master("spark://JupyterHub:7077")\
~/spark3.1.1/python/pyspark/sql/session.py in getorCreate(self)
226 sparkConf.set(key,value)
227 # This SparkContext may be an existing one.
--> 228 sc = SparkContext.getorCreate(sparkConf)
229 # Do not update `SparkConf` for existing `SparkContext`,as it's shared
230 # by all sessions.
~/spark3.1.1/python/pyspark/context.py in getorCreate(cls,conf)
382 with SparkContext._lock:
383 if SparkContext._active_spark_context is None:
--> 384 SparkContext(conf=conf or SparkConf())
385 return SparkContext._active_spark_context
386
~/spark3.1.1/python/pyspark/context.py in __init__(self,master,appName,sparkHome,pyFiles,environment,batchSize,serializer,conf,gateway,jsc,profiler_cls)
142 " is not allowed as it is a security risk.")
143
--> 144 SparkContext._ensure_initialized(self,gateway=gateway,conf=conf)
145 try:
146 self._do_init(master,~/spark3.1.1/python/pyspark/context.py in _ensure_initialized(cls,instance,conf)
329 with SparkContext._lock:
330 if not SparkContext._gateway:
--> 331 SparkContext._gateway = gateway or launch_gateway(conf)
332 SparkContext._jvm = SparkContext._gateway.jvm
333
~/spark3.1.1/python/pyspark/java_gateway.py in launch_gateway(conf,popen_kwargs)
106
107 if not os.path.isfile(conn_info_file):
--> 108 raise Exception("Java gateway process exited before sending its port number")
109
110 with open(conn_info_file,"rb") as info:
Exception: Java gateway process exited before sending its port number
我已经在默认的 8080 端口上检查了 JupyterHub:7077 并且一切正常,所以我成功地启动了 master 和 worker。
甚至通过在本地模式下使用 master("local[*]")
运行 spark 我得到了与上面相同的完全错误
我完全迷失了,知道为什么我不能在 shell 和 jupyter notebook 上运行 pyspark 吗?
谢谢
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。