微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

Pyspark 计数空值列值特定

如何解决Pyspark 计数空值列值特定

我的输入火花数据框是;

  Date        Client  Current 
    2020-10-26  1       NULL   
    2020-10-27  1       NULL   
    2020-10-28  1       NULL   
    2020-10-29  1       NULL   
    2020-10-30  1       NULL   
    2020-10-31  1       NULL   
    2020-11-01  1       NULL   
    2020-11-02  1       NULL    
    2020-11-03  1       NULL    
    2020-11-04  1       NULL    
    2020-11-05  1       NULL    
    2020-11-06  1       NULL    
    2020-11-07  1       NULL    
    2020-11-08  1       NULL    
    2020-11-09  1       NULL    
    2020-10-26  2       NULL    
    2020-10-27  2       NULL    
    2020-10-28  2       NULL    
    2020-10-29  2       10      
    2020-10-30  2       23      
    2020-10-31  2       NULL    
    2020-11-01  2       NULL    
    2020-11-02  2       1       
    2020-11-03  2       NULL    
    2020-11-04  2       NULL    
    2020-11-05  2       3       
    2020-10-27  3       NULL    
    2020-10-28  3       NULL    
    2020-10-29  3       10      
    2020-10-30  3       NULL    
    2020-10-31  3       NULL    
    2020-11-01  3       NULL    
    2020-11-02  3       NULL    
    2020-11-03  3       32      
    2020-11-04  3       NULL    
    2020-11-05  3       3       
    2020-11-03  4       NULL    
    2020-11-04  4       NULL    
    2020-11-05  4       NULL  

Dataframe 按 client_no 和日期排序。如果客户端的“当前列”完全为空,则 Full_NULL_Count 列应在客户端的第一行写入空值。我根据上面的数据分享了想要的输出

   Date        Client  Current Full_NULL_Count
    2020-10-26  1       NULL    15   -> All "Current" values are null for client 1. So first row 
                                        value is  equal to total null count for Client 1 .    
    2020-10-27  1       NULL    NULL
    2020-10-28  1       NULL    NULL
    2020-10-29  1       NULL    NULL
    2020-10-30  1       NULL    NULL
    2020-10-31  1       NULL    NULL
    2020-11-01  1       NULL    NULL
    2020-11-02  1       NULL    NULL
    2020-11-03  1       NULL    NULL
    2020-11-04  1       NULL    NULL
    2020-11-05  1       NULL    NULL
    2020-11-06  1       NULL    NULL
    2020-11-07  1       NULL    NULL
    2020-11-08  1       NULL    NULL
    2020-11-09  1       NULL    NULL
    2020-10-26  2       NULL    NULL ->There are non null current values for Client 2. So it' s null.
    2020-10-27  2       NULL    NULL
    2020-10-28  2       NULL    NULL
    2020-10-29  2       10      NULL
    2020-10-30  2       23      NULL
    2020-10-31  2       NULL    NULL
    2020-11-01  2       NULL    NULL
    2020-11-02  2       1       NULL
    2020-11-03  2       NULL    NULL
    2020-11-04  2       NULL    NULL
    2020-11-05  2       3       NULL
    2020-10-27  3       NULL    NULL ->There are non null current values for Client 3. So it' s null.
    2020-10-28  3       NULL    NULL
    2020-10-29  3       10      NULL
    2020-10-30  3       NULL    NULL
    2020-10-31  3       NULL    NULL
    2020-11-01  3       NULL    NULL
    2020-11-02  3       NULL    NULL
    2020-11-03  3       32      NULL
    2020-11-04  3       NULL    NULL
    2020-11-05  3       3       NULL
    2020-11-03  4       NULL    3    -> All "Current" values are null for client 4. So first row 
                                        value is  equal to total null count for Client 4.   
    2020-11-04  4       NULL    NULL
    2020-11-05  4       NULL    NULL

你能帮我解决这个问题吗?

解决方法

您可以检查客户端的空值数量,并将其与该客户端的行数进行比较。

from pyspark.sql import functions as F,Window
w = Window.partitionBy('Client')

result = df.withColumn(
    'Full_NULL_count',F.when(
        F.sum(F.col('Current').isNull().cast('int')).over(w) == F.count('*').over(w),F.count('*').over(w)
    )
).withColumn(
    'rn',F.row_number().over(w.orderBy('Date'))
).withColumn(
    'Full_NULL_count',F.when(
        F.col('rn') == 1,F.col('Full_NULL_count')
    )
).drop('rn').orderBy('Client','Date')

result.show(99)
+----------+------+-------+---------------+
|      Date|Client|Current|Full_NULL_count|
+----------+------+-------+---------------+
|2020-10-26|     1|   null|             15|
|2020-10-27|     1|   null|           null|
|2020-10-28|     1|   null|           null|
|2020-10-29|     1|   null|           null|
|2020-10-30|     1|   null|           null|
|2020-10-31|     1|   null|           null|
|2020-11-01|     1|   null|           null|
|2020-11-02|     1|   null|           null|
|2020-11-03|     1|   null|           null|
|2020-11-04|     1|   null|           null|
|2020-11-05|     1|   null|           null|
|2020-11-06|     1|   null|           null|
|2020-11-07|     1|   null|           null|
|2020-11-08|     1|   null|           null|
|2020-11-09|     1|   null|           null|
|2020-10-26|     2|   null|           null|
|2020-10-27|     2|   null|           null|
|2020-10-28|     2|   null|           null|
|2020-10-29|     2|     10|           null|
|2020-10-30|     2|     23|           null|
|2020-10-31|     2|   null|           null|
|2020-11-01|     2|   null|           null|
|2020-11-02|     2|      1|           null|
|2020-11-03|     2|   null|           null|
|2020-11-04|     2|   null|           null|
|2020-11-05|     2|      3|           null|
|2020-10-27|     3|   null|           null|
|2020-10-28|     3|   null|           null|
|2020-10-29|     3|     10|           null|
|2020-10-30|     3|   null|           null|
|2020-10-31|     3|   null|           null|
|2020-11-01|     3|   null|           null|
|2020-11-02|     3|   null|           null|
|2020-11-03|     3|     32|           null|
|2020-11-04|     3|   null|           null|
|2020-11-05|     3|      3|           null|
|2020-11-03|     4|   null|              3|
|2020-11-04|     4|   null|           null|
|2020-11-05|     4|   null|           null|
+----------+------+-------+---------------+
,

您可以在一行中实现这一点。计算每个客户端的空值,如果计数与每个客户端的记录数匹配,则添加该计数,否则为空

from pyspark.sql import functions as f
from pyspark.sql import Window

w = Window.partitionBy('Client')

df = df.withColumn("Full_NULL_Count",f.when(f.sum(f.when(f.col("Current").isNotNull(),0).otherwise(1))
                                             .over(w) == f.count('*').over(w),f.count('*').over(w)).otherwise(None))
df.show()

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。