如何解决Spark读取数据集错误和奇怪
从S3读取文件时遇到奇怪的问题。这就是我在做的
val previousDay = spark.read
.option("header","false")
.schema(schema)
.csv(loadPath)
.cache()
这是架构
StructType(
Array(
StructField("location_id",DataTypes.StringType,nullable = true),StructField("uuid",StructField("country_code",StructField("shard",StructField("has_activity",nullable = true)
)
)
csv就是这样
"location_id","uuid","country_code","shard","has_activity"
"35fb2f0XX","06d0XX","FRA","eu","t"
"9ee98XX","7cd3c7XX","DEU",""
"9d193XX","128abXX","ITA",""
但是,当我在previousDay上表演时,这就是我得到的
--------------------+--------------------+------------+
| lid. | uid |country |activity |shard|
+--------------------+--------------------+------------
|location_id | uuid |country_code| shard| eu|
|35fb2f0XX |6d0XX | FRA| eu| eu|
|9ee98XX |7cd3c7XX| DEU| eu| eu|
|9d193XX. |128abXX | ITA| eu| eu|
如此处所示,分片值在两列之间复制,并且活动完全消失。
我不知道发生了什么。 我将不胜感激
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。