如何解决将pyspark中的嵌套数据框展平为列
嗨,我有pyspark中提取的JSON数据,示例如下。
{
"data": [
["row-r9pv-p86t.ifsp","00000000-0000-0000-0838-60C2FFCC43AE",1574264158,null,"{ }","2007","ZOEY","KINGS","F","11"],["row-7v2v~88z5-44se","00000000-0000-0000-C8FC-DDD3F9A72DFF","SUFFOLK","6"],["row-hzc9-4kvv~mbc9","00000000-0000-0000-562E-D9A0792557FC","MONROE","6"]
]
}
我正在尝试分解多数组并将每个记录分解为数据帧的单行,但看起来像这样:
df= spark.read.json('data/rows.json',multiLine=True)
temp_df = df.select(explode("data").alias("data"))
temp_df.show(n=3,truncate=False)
结果:
+-----------------------------------------------------------------------------------------------------------------------+
|data |
+-----------------------------------------------------------------------------------------------------------------------+
|[row-r9pv-p86t.ifsp,00000000-0000-0000-0838-60C2FFCC43AE,{ },2007,ZOEY,KINGS,F,11] |
|[row-7v2v~88z5-44se,00000000-0000-0000-C8FC-DDD3F9A72DFF,SUFFOLK,6]|
|[row-hzc9-4kvv~mbc9,00000000-0000-0000-562E-D9A0792557FC,MONROE,6] |
+-----------------------------------------------------------------------------------------------------------------------+
temp_df.printSchema()
temp_df.show(5)
temp_df.select(flatten(temp_df.data)).show(n=10)
到目前为止还不错,但是当我尝试使用flatten
方法将数据帧的每一行中的数组展平时,它给了我错误提示
cannot resolve 'flatten('data')' due to data type mismatch: The argument should be an array of arrays,but 'data' is of array<string> type.
很有道理,但我不确定如何才能展平数组。
我应该编写任何自定义map
方法将行数组映射到数据框列吗?
解决方法
回答我自己的问题。这样可以帮助任何需要的人。
- 从文件读取源数据
df= spark.read.json('data/rows.json',multiLine=True)
temp_df = df.select(explode("data").alias("data"))
temp_df.show(n=3,truncate=False)
结果:
+-----------------------------------------------------------------------------------------------------------------------+
|data |
+-----------------------------------------------------------------------------------------------------------------------+
|[row-r9pv-p86t.ifsp,00000000-0000-0000-0838-60C2FFCC43AE,1574264158,{ },2007,ZOEY,KINGS,F,11] |
|[row-7v2v~88z5-44se,00000000-0000-0000-C8FC-DDD3F9A72DFF,SUFFOLK,6]|
|[row-hzc9-4kvv~mbc9,00000000-0000-0000-562E-D9A0792557FC,MONROE,6] |
+-----------------------------------------------------------------------------------------------------------------------+
在上面的数据框中,每个单元格包含一个字符串数组,而我需要的是每个元素都在单独的列中和特定的数据类型。
df = temp_df.withColumn("sid",temp_df["data"].getItem(0).cast(StringType())) \
.withColumn("id",temp_df["data"].getItem(1).cast(IntegerType())) \
.withColumn("position",temp_df["data"].getItem(2).cast(IntegerType())) \
.withColumn("created_at",temp_df["data"].getItem(3).cast(TimestampType())) \
.withColumn("created_meta",temp_df["data"].getItem(4).cast(StringType())) \
.withColumn("updated_at",temp_df["data"].getItem(5).cast(TimestampType())) \
.withColumn("updated_meta",temp_df["data"].getItem(6).cast(StringType())) \
.withColumn("meta",temp_df["data"].getItem(7).cast(StringType())) \
.withColumn("Year",(temp_df["data"].getItem(8)).cast(IntegerType())) \
.withColumn("First Name",temp_df["data"].getItem(9).cast(StringType())) \
.withColumn("County",temp_df["data"].getItem(10).cast(StringType())) \
.withColumn("Sex",temp_df["data"].getItem(11).cast(StringType())) \
.withColumn("Count",temp_df["data"].getItem(12).cast(IntegerType())) \
.drop("data")
df.show()
df.printSchema()
+------------------+----+--------+----------+------------+----------+------------+----+----+----------+-------+---+-----+
| sid| id|position|created_at|created_meta|updated_at|updated_meta|meta|Year|First Name| County|Sex|Count|
+------------------+----+--------+----------+------------+----------+------------+----+----+----------+-------+---+-----+
|row-r9pv-p86t.ifsp|null| 0| null| null| null| null| { }|2007| ZOEY| KINGS| F| 11|
|row-7v2v~88z5-44se|null| 0| null| null| null| null| { }|2007| ZOEY|SUFFOLK| F| 6|
|row-hzc9-4kvv~mbc9|null| 0| null| null| null| null| { }|2007| ZOEY| MONROE| F| 6|
+------------------+----+--------+----------+------------+----------+------------+----+----+----------+-------+---+-----+
==================== SCHEMA ====================
root
|-- sid: string (nullable = true)
|-- id: integer (nullable = true)
|-- position: integer (nullable = true)
|-- created_at: timestamp (nullable = true)
|-- created_meta: string (nullable = true)
|-- updated_at: timestamp (nullable = true)
|-- updated_meta: string (nullable = true)
|-- meta: string (nullable = true)
|-- Year: integer (nullable = true)
|-- First Name: string (nullable = true)
|-- County: string (nullable = true)
|-- Sex: string (nullable = true)
|-- Count: integer (nullable = true)
,
val resDF = temp_df.select(
'data.getItem(0).alias("c0"),'data.getItem(1).alias("c1"),'data.getItem(2).alias("c2"),'data.getItem(3).alias("c3")
// ...
)
resDF.show(false)
// +------------------+------------------------------------+---+----------+
// |c0 |c1 |c2 |c3 |
// +------------------+------------------------------------+---+----------+
// |row-r9pv-p86t.ifsp|00000000-0000-0000-0838-60C2FFCC43AE|0 |1574264158|
// |row-7v2v~88z5-44se|00000000-0000-0000-C8FC-DDD3F9A72DFF|0 |1574264158|
// |row-hzc9-4kvv~mbc9|00000000-0000-0000-562E-D9A0792557FC|0 |1574264158|
// +------------------+------------------------------------+---+----------+
V 2(使用WithColumn和concat_ws):
val sourceDF = Seq(
Array("row-r9pv-p86t.ifsp","00000000-0000-0000-0838-60C2FFCC43AE","0","1574264158","","{ }","2007","ZOEY","KINGS","F","11"),Array("row-7v2v~88z5-44se","00000000-0000-0000-C8FC-DDD3F9A72DFF","SUFFOLK","6"),Array("row-hzc9-4kvv~mbc9","00000000-0000-0000-562E-D9A0792557FC","MONROE","6")
).toDF("dataColumn")
sourceDF.show(false)
// +-------------------------------------------------------------------------------------------------------------------------+
// |dataColumn |
// +-------------------------------------------------------------------------------------------------------------------------+
// |[row-r9pv-p86t.ifsp,11] |
// |[row-7v2v~88z5-44se,6]|
// |[row-hzc9-4kvv~mbc9,6] |
// +-------------------------------------------------------------------------------------------------------------------------+
val df1 = sourceDF
.withColumn("dataString",concat_ws(",",'dataColumn))
.select('dataString)
df1.printSchema()
df1.show(false)
// root
// |-- dataString: string (nullable = false)
//
// +-----------------------------------------------------------------------------------------------------------------------+
// |dataString |
// +-----------------------------------------------------------------------------------------------------------------------+
// |row-r9pv-p86t.ifsp,11 |
// |row-7v2v~88z5-44se,6|
// |row-hzc9-4kvv~mbc9,6 |
// +-----------------------------------------------------------------------------------------------------------------------+
val df2 = df1.select(
split('dataString,").getItem(0).alias("c0"),split('dataString,").getItem(1).alias("c1"),").getItem(2).alias("c2"),").getItem(3).alias("c3"),").getItem(4).alias("c4"),").getItem(5).alias("c5"),").getItem(6).alias("c6"),").getItem(7).alias("c7"),").getItem(8).alias("c8"),").getItem(9).alias("c9"),").getItem(10).alias("c10"),").getItem(11).alias("c11"),").getItem(12).alias("c12")
)
df2.printSchema()
// root
// |-- c0: string (nullable = true)
// |-- c1: string (nullable = true)
// |-- c2: string (nullable = true)
// |-- c3: string (nullable = true)
// |-- c4: string (nullable = true)
// |-- c5: string (nullable = true)
// |-- c6: string (nullable = true)
// |-- c7: string (nullable = true)
// |-- c8: string (nullable = true)
// |-- c9: string (nullable = true)
// |-- c10: string (nullable = true)
// |-- c11: string (nullable = true)
// |-- c12: string (nullable = true)
df2.show(false)
// +------------------+------------------------------------+---+----------+---+----------+---+---+----+----+-------+---+---+
// |c0 |c1 |c2 |c3 |c4 |c5 |c6 |c7 |c8 |c9 |c10 |c11|c12|
// +------------------+------------------------------------+---+----------+---+----------+---+---+----+----+-------+---+---+
// |row-r9pv-p86t.ifsp|00000000-0000-0000-0838-60C2FFCC43AE|0 |1574264158| |1574264158| |{ }|2007|ZOEY|KINGS |F |11 |
// |row-7v2v~88z5-44se|00000000-0000-0000-C8FC-DDD3F9A72DFF|0 |1574264158| |1574264158| |{ }|2007|ZOEY|SUFFOLK|F |6 |
// |row-hzc9-4kvv~mbc9|00000000-0000-0000-562E-D9A0792557FC|0 |1574264158| |1574264158| |{ }|2007|ZOEY|MONROE |F |6 |
// +------------------+------------------------------------+---+----------+---+----------+---+---+----+----+-------+---+---+
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。