如何解决如何在 PySpark 数据帧的第 0 轴上找到数组数组列的平均值?
我有一个 PySpark 数据框-
df = spark.createDataFrame([
("u1",[[1.,2.,3.],[1.,0.],0.,0.]]),("u2",10.,("u3",[10.,],['user_id','features'])
print(df.printSchema())
df.show(truncate=False)
输出-
root
|-- user_id: string (nullable = true)
|-- features: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: double (containsNull = true)
None
+-------+---------------------------------------------------+
|user_id|features |
+-------+---------------------------------------------------+
|u1 |[[1.0,2.0,3.0],[1.0,0.0],0.0,0.0]]|
|u2 |[[1.0,10.0,0.0]] |
|u3 |[[1.0,[10.0,0.0]] |
+-------+---------------------------------------------------+
我想计算第 0 轴上每个用户的这些数组的平均值。所需的输出看起来像-
+-------+---------------------------------------------------+----------------+
|user_id|features |avg_features |
+-------+---------------------------------------------------+----------------+
|u1 |[[1.0,0.0]]|[1.0,1.33,1.0]|
|u2 |[[1.0,0.0]] |[1.0,0.0]|
|u3 |[[1.0,0.0]] |[5.5,1.5]|
+-------+---------------------------------------------------+----------------+
我如何实现这一目标?
解决方法
编辑:更具可扩展性的解决方案:
src
使用 import pyspark.sql.functions as F
df2 = df.withColumn(
'exploded_features',F.explode('features')
).select(
'user_id','features',F.posexplode('exploded_features')
).groupBy(
'user_id','pos'
).agg(
F.mean('col')
).groupBy(
'user_id','features'
).agg(
F.array_sort(
F.collect_list(
F.array('pos','avg(col)')
)
).alias('avg_features')
).withColumn(
'avg_features',F.expr('transform(avg_features,x -> x[1])')
)
df2.show(truncate=False)
+-------+---------------------------------------------------+------------------------------+
|user_id|features |avg_features |
+-------+---------------------------------------------------+------------------------------+
|u1 |[[1.0,2.0,3.0],[1.0,0.0],0.0,0.0]]|[1.0,1.3333333333333333,1.0]|
|u2 |[[1.0,10.0,0.0]] |[1.0,0.0] |
|u3 |[[1.0,[10.0,0.0]] |[5.5,1.5] |
+-------+---------------------------------------------------+------------------------------+
和 aggregate
对数组进行操作:
transform
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。