如何解决为什么Spark包解析器`--packages`不会将依赖项复制到$ SPARK_HOME / jars? 与Jupyter一起使用
有人可以向我解释为什么我在com.amazonaws_aws-java-sdk-bundle
上使用自动程序包解析器,为什么我必须手动将--packages
复制到本地$ SPARK_HOME吗?
我所做的是从spark-shell开始进行spark提交:
$SPARK_HOME/bin/spark-shell \
--master k8s://https://localhost:6443 \
--deploy-mode client \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=spark:spark-docker \
--packages org.apache.hadoop:hadoop-aws:3.2.0,io.delta:delta-core_2.12:0.7.0 \
--conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore \
--conf spark.hadoop.fs.path.style.access=true \
--conf spark.hadoop.fs.s3a.access.key=$MINIO_ACCESS_KEY \
--conf spark.hadoop.fs.s3a.secret.key=$MINIO_SECRET_KEY \
--conf spark.hadoop.fs.s3a.endpoint=$MINIO_ENDPOINT \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=false \
--conf spark.hadoop.fs.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.driver.port=4040 \
--name spark-locally
我的设置是最新的Spark 3.0.1和Hadoop 3.2 here,以及本地Kubernetes和Docker Desktop for Mac。
上面所说的--packages org.apache.hadoop:hadoop-aws:3.2.0
可以成功下载依赖项,它具有com.amazonaws_aws-java-sdk-bundle-1.11.375作为依赖项:
Ivy Default Cache set to: /Users/sspaeti/.ivy2/cache
The jars for the packages stored in: /Users/sspaeti/.ivy2/jars
:: loading settings :: url = jar:file:/Users/sspaeti/Documents/spark/spark-3.0.1-bin-hadoop3.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hadoop#hadoop-aws added as a dependency
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-91fd31e1-0b2a-448c-9c69-fd9dc430d41c;1.0
confs: [default]
found org.apache.hadoop#hadoop-aws;3.2.0 in central
found com.amazonaws#aws-java-sdk-bundle;1.11.375 in central
found io.delta#delta-core_2.12;0.7.0 in central
found org.antlr#antlr4;4.7 in central
found org.antlr#antlr4-runtime;4.7 in central
found org.antlr#antlr-runtime;3.5.2 in central
found org.antlr#ST4;4.0.8 in central
found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central
found org.glassfish#javax.json;1.0.4 in central
found com.ibm.icu#icu4j;58.2 in central
:: resolution report :: resolve 376ms :: artifacts dl 22ms
:: modules in use:
com.amazonaws#aws-java-sdk-bundle;1.11.375 from central in [default]
com.ibm.icu#icu4j;58.2 from central in [default]
io.delta#delta-core_2.12;0.7.0 from central in [default]
org.abego.treelayout#org.abego.treelayout.core;1.0.3 from central in [default]
org.antlr#ST4;4.0.8 from central in [default]
org.antlr#antlr-runtime;3.5.2 from central in [default]
org.antlr#antlr4;4.7 from central in [default]
org.antlr#antlr4-runtime;4.7 from central in [default]
org.apache.hadoop#hadoop-aws;3.2.0 from central in [default]
org.glassfish#javax.json;1.0.4 from central in [default]
但是为什么然后,我总是收到java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException
的错误here。我不了解,当我进入deploy-mode client
时,我以为Maven会解决对本地Spark(驱动程序)的所有依赖关系,不是吗?还是缺少的拼图在哪里?
我也尝试过--packages org.apache.hadoop:hadoop-aws:3.2.0,io.delta:delta-core_2.12:0.7.0,com.amazonaws:aws-java-sdk-bundle:1.11.375
也没有运气。
我的解决方案,但不知道为什么要这么做
起作用的是我手动复制(从maven复制),或者直接从下载的.ivy2
文件夹复制,如下所示:
cp $HOME/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.11.375.jar $SPARK_HOME/jars
之后,我可以成功读写本地S3(微型)。
与Jupyter一起使用
另一个奇怪的事情是,我还在本地Kubernetes上安装了Jupyter Notebook,在那里它可以与普通的--packages
一起使用。在那里,我使用pyspark,区别在于pyspark可以工作,但不在spark-shell上吗?
如果是这样,我将如何在pyspark上而不是spark-shell上进行相同的测试?
非常感谢您的解释,我已经为此浪费了很多时间。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。