如何解决Pyspark 计数空值列值特定
我的输入火花数据框是;
Date Client Current
2020-10-26 1 NULL
2020-10-27 1 NULL
2020-10-28 1 NULL
2020-10-29 1 NULL
2020-10-30 1 NULL
2020-10-31 1 NULL
2020-11-01 1 NULL
2020-11-02 1 NULL
2020-11-03 1 NULL
2020-11-04 1 NULL
2020-11-05 1 NULL
2020-11-06 1 NULL
2020-11-07 1 NULL
2020-11-08 1 NULL
2020-11-09 1 NULL
2020-10-26 2 NULL
2020-10-27 2 NULL
2020-10-28 2 NULL
2020-10-29 2 10
2020-10-30 2 23
2020-10-31 2 NULL
2020-11-01 2 NULL
2020-11-02 2 1
2020-11-03 2 NULL
2020-11-04 2 NULL
2020-11-05 2 3
2020-10-27 3 NULL
2020-10-28 3 NULL
2020-10-29 3 10
2020-10-30 3 NULL
2020-10-31 3 NULL
2020-11-01 3 NULL
2020-11-02 3 NULL
2020-11-03 3 32
2020-11-04 3 NULL
2020-11-05 3 3
2020-11-03 4 NULL
2020-11-04 4 NULL
2020-11-05 4 NULL
Dataframe 按 client_no 和日期排序。如果客户端的“当前列”完全为空,则 Full_NULL_Count 列应在客户端的第一行写入空值。我根据上面的数据分享了想要的输出;
Date Client Current Full_NULL_Count
2020-10-26 1 NULL 15 -> All "Current" values are null for client 1. So first row
value is equal to total null count for Client 1 .
2020-10-27 1 NULL NULL
2020-10-28 1 NULL NULL
2020-10-29 1 NULL NULL
2020-10-30 1 NULL NULL
2020-10-31 1 NULL NULL
2020-11-01 1 NULL NULL
2020-11-02 1 NULL NULL
2020-11-03 1 NULL NULL
2020-11-04 1 NULL NULL
2020-11-05 1 NULL NULL
2020-11-06 1 NULL NULL
2020-11-07 1 NULL NULL
2020-11-08 1 NULL NULL
2020-11-09 1 NULL NULL
2020-10-26 2 NULL NULL ->There are non null current values for Client 2. So it' s null.
2020-10-27 2 NULL NULL
2020-10-28 2 NULL NULL
2020-10-29 2 10 NULL
2020-10-30 2 23 NULL
2020-10-31 2 NULL NULL
2020-11-01 2 NULL NULL
2020-11-02 2 1 NULL
2020-11-03 2 NULL NULL
2020-11-04 2 NULL NULL
2020-11-05 2 3 NULL
2020-10-27 3 NULL NULL ->There are non null current values for Client 3. So it' s null.
2020-10-28 3 NULL NULL
2020-10-29 3 10 NULL
2020-10-30 3 NULL NULL
2020-10-31 3 NULL NULL
2020-11-01 3 NULL NULL
2020-11-02 3 NULL NULL
2020-11-03 3 32 NULL
2020-11-04 3 NULL NULL
2020-11-05 3 3 NULL
2020-11-03 4 NULL 3 -> All "Current" values are null for client 4. So first row
value is equal to total null count for Client 4.
2020-11-04 4 NULL NULL
2020-11-05 4 NULL NULL
你能帮我解决这个问题吗?
解决方法
您可以检查客户端的空值数量,并将其与该客户端的行数进行比较。
from pyspark.sql import functions as F,Window
w = Window.partitionBy('Client')
result = df.withColumn(
'Full_NULL_count',F.when(
F.sum(F.col('Current').isNull().cast('int')).over(w) == F.count('*').over(w),F.count('*').over(w)
)
).withColumn(
'rn',F.row_number().over(w.orderBy('Date'))
).withColumn(
'Full_NULL_count',F.when(
F.col('rn') == 1,F.col('Full_NULL_count')
)
).drop('rn').orderBy('Client','Date')
result.show(99)
+----------+------+-------+---------------+
| Date|Client|Current|Full_NULL_count|
+----------+------+-------+---------------+
|2020-10-26| 1| null| 15|
|2020-10-27| 1| null| null|
|2020-10-28| 1| null| null|
|2020-10-29| 1| null| null|
|2020-10-30| 1| null| null|
|2020-10-31| 1| null| null|
|2020-11-01| 1| null| null|
|2020-11-02| 1| null| null|
|2020-11-03| 1| null| null|
|2020-11-04| 1| null| null|
|2020-11-05| 1| null| null|
|2020-11-06| 1| null| null|
|2020-11-07| 1| null| null|
|2020-11-08| 1| null| null|
|2020-11-09| 1| null| null|
|2020-10-26| 2| null| null|
|2020-10-27| 2| null| null|
|2020-10-28| 2| null| null|
|2020-10-29| 2| 10| null|
|2020-10-30| 2| 23| null|
|2020-10-31| 2| null| null|
|2020-11-01| 2| null| null|
|2020-11-02| 2| 1| null|
|2020-11-03| 2| null| null|
|2020-11-04| 2| null| null|
|2020-11-05| 2| 3| null|
|2020-10-27| 3| null| null|
|2020-10-28| 3| null| null|
|2020-10-29| 3| 10| null|
|2020-10-30| 3| null| null|
|2020-10-31| 3| null| null|
|2020-11-01| 3| null| null|
|2020-11-02| 3| null| null|
|2020-11-03| 3| 32| null|
|2020-11-04| 3| null| null|
|2020-11-05| 3| 3| null|
|2020-11-03| 4| null| 3|
|2020-11-04| 4| null| null|
|2020-11-05| 4| null| null|
+----------+------+-------+---------------+
,
您可以在一行中实现这一点。计算每个客户端的空值,如果计数与每个客户端的记录数匹配,则添加该计数,否则为空
from pyspark.sql import functions as f
from pyspark.sql import Window
w = Window.partitionBy('Client')
df = df.withColumn("Full_NULL_Count",f.when(f.sum(f.when(f.col("Current").isNotNull(),0).otherwise(1))
.over(w) == f.count('*').over(w),f.count('*').over(w)).otherwise(None))
df.show()
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。