如何解决如何在Pyspark中查找数组列的多模式
我想在此数据框中找到任务列的模式:
+-----+-----------------------------------------+
| id | task |
+-----+-----------------------------------------+
| 101 | [person1,person1,person3] |
| 102 | [person1,person2,person3] |
| 103 | null |
| 104 | [person1,person2] |
| 105 | [person1,person2] |
| 106 | null |
+-----+-----------------------------------------+
如果有多种模式,我想全部显示。
有人可以帮我得到这个输出:
+-----+-----------------------------------------+---------------------------+
| id | task | mode |
+-----+-----------------------------------------+---------------------------+
| 101 | [person1,person3] |[person1] |
| 102 | [person1,person3] |[person1,person3]|
| 103 | null |[] |
| 104 | [person1,person2] |[person1,person2] |
| 105 | [person1,person2] |[person1,person2] |
| 106 | null |[] |
+-----+-----------------------------------------+---------------------------+
这是我的第一个问题。任何帮助或提示,我们将不胜感激。谢谢。
解决方法
我看不出有理由在 UDF
中使用 spark2.4+
,因为我们可以使用 {{1} } 以获得所需的输出。与高阶函数相比,使用计数器的UDF对于大数据来说非常慢:
higher order functions
,
使用Spark 2.3:
您可以使用自定义UDF
解决此问题。为了获得多个模式值,我使用了Counter
。对于您的except
列中的空情况,我在UDF中使用了task
块。
(对于Python 3.8+用户,您可以使用statistics.multimode()
内置函数)
您的数据框:
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.functions import *
schema = StructType([StructField("id",IntegerType()),StructField("task",ArrayType(StringType()))])
data = [[101,["person1","person1","person3"]],[102,"person2",[103,None],[104,"person2"]],[105,[106,None]]
df = spark.createDataFrame(data,schema=schema)
操作:
from collections import Counter
def get_multi_mode_list(input_array):
multi_mode = []
counter_var = Counter(input_array)
try:
temp = counter_var.most_common(1)[0][1]
except:
temp = counter_var.most_common(1)
for i in counter_var:
if input_array.count(i) == temp:
multi_mode.append(i)
return(list(set(multi_mode)))
get_multi_mode_list_udf = F.udf(get_multi_mode_list,ArrayType(StringType()))
df.withColumn("multi_mode",get_multi_mode_list_udf(col("task"))).show(truncate=False)
输出:
+---+------------------------------------+---------------------------+
|id |task |multi_mode |
+---+------------------------------------+---------------------------+
|101|[person1,person1,person3] |[person1] |
|102|[person1,person2,person3] |[person2,person3,person1]|
|103|null |[] |
|104|[person1,person2] |[person2,person1] |
|105|[person1,person2]|[person2,person1] |
|106|null |[] |
+---+------------------------------------+---------------------------+
,
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from collections import Counter
from itertools import groupby
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql import *
from collections import *
from pyspark.sql.functions import udf,explode
from pyspark.sql.types import StringType
table_schema = StructType([
StructField('key2',IntegerType(),True),StructField('list6',ArrayType(StringType()),False)
])
df= spark.createDataFrame([
( 101,"person3"] ),(102,"person3"] ),( 103,None ),( 104,"person2"]),(105,"person2"])],["id","List"])
def mode(list1):
res = []
if(list1 is None or len(list1)==0):
return []
test_list1 = Counter(list1)
temp = test_list1.most_common(1)[0][1]
for ele in list1:
if list1.count(ele) == temp:
res.append(ele)
return list(set(res))
df.createOrReplaceTempView("A")
spark.udf.register("mode",mode,ArrayType(StringType()))
spark.sql("select id,list,mode(list)func from A").show(truncate=False)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。