微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

无法在 Windows 10 上的本地文件系统上保存 rdd

如何解决无法在 Windows 10 上的本地文件系统上保存 rdd

我有一个 scala/spark 程序,用于验证输入目录中的 xmls 文件,然后将报告写入另一个输入参数(要写入报告的本地文件系统路径)。

根据利益相关者的要求,该程序将在本地机器上运行,因此我在本地模式下使用 spark。 到目前为止一切都很好,我正在使用下面的代码将我的报告保存到文件

dataframe.repartition(1)
    .write
    .option("header","true")
    .mode("overwrite")
    .csv(reportPath)

然而,这需要在运行我的程序的机器上安装/配置 winutils。

鉴于我们经常使用 cloudera 更新,因此每次更新后更改 winutils 都会产生开销,因为我们会将 jar 更新到我们的 pom 文件中的最新版本。因此,我被要求删除对 winutils 的依赖

快速谷歌搜索和遇到How to save Spark RDD to local filesystem 我决定把上面的代码改成

val outputRdd = dataframe.rdd
val count = outputRdd.count()
println("\nCount is: " + count + "\n")
println("\nOutput path is: " + reportPath + "\n")
outputRdd.coalesce(1).saveAsTextFile(reportPath)

但是,在运行代码时,我现在收到此错误

Count is: 15


Output path is: C:\\codingdir\\test\\report

Exception in thread "main" java.lang.IllegalAccessError: tried to access method org.apache.hadoop.mapred.JobContextImpl.<init>(Lorg/apache/hadoop/mapred/JobConf;Lorg/apache/hadoop/mapreduce/JobID;)V from class org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil
    at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.createJobContext(SparkHadoopWriter.scala:178)
    at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:67)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478)
    at com.optus.dcoe.hawk.XmlParser$.delayedEndpoint$com$optus$dcoe$hawk$XmlParser$1(XmlParser.scala:120)
    at com.optus.dcoe.hawk.XmlParser$delayedInit$body.apply(XmlParser.scala:16)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at com.optus.dcoe.hawk.XmlParser$.main(XmlParser.scala:16)
    at com.optus.dcoe.hawk.XmlParser.main(XmlParser.scala)

我尝试将 reportPath 变量的值更改为 C:\codingdir\test\report file://C:/codingdir/test/report file://C:/codingdir/test/report

和其他建议的值

和其他链接,但我仍然遇到相同的错误

我找到了这些关于 java.lang.IllegalAccessError 的文章,但不知道如何解决这个错误

有人可以帮我解决这个问题吗?

与 winutls 相关的环境变量 HADOOP_HOME 已被删除。 winutils 条目已从 PATH 变量中删除 我在 Windows 10 上使用 java 8(该程序的所有用户都在类似的笔记本电脑上) Spark 版本为 2.4.0-cdh6.2.1

解决方法

终于发现问题了, 这是由一些不需要的与 mapreduce 相关的依赖项引起的,这些依赖项现在已被删除,我现在转移到另一个错误

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。