如何解决使用pyspark从表中获取下一个不同的记录
我有一张包含以下数据的表
我正在尝试获取下一个不同的值集(接下来的3个值),如下所示
我尝试使用Lead函数,但最终得到以下结果
spark.sql("select *,\
coalesce(lead(page,1) over (partition by id order by date_time asc),'Exit') as next_pagename1,2) over (partition by id order by date_time asc),'Exit') as next_pagename2,3) over (partition by id order by date_time asc),'Exit') as next_pagename3,\
from temp").show()
有人可以让我知道我在这里想念什么吗?
编辑:
更新了示例数据
解决方法
您可以按id
和page
对数据进行分组,并取最小值date_time
。之后,您可以使用已经提供的sql:
spark.sql("""with data as (
select id,page,min(date_time) as date_time from
temp group by id,page)
select data.id,data.page,coalesce(lead(data.page,1) over (partition by data.id
order by data.date_time asc),'Exit') as next_pagename1,2) over (partition by data.id
order by data.date_time asc),'Exit') as next_pagename2,3) over (partition by data.id
order by data.date_time asc),'Exit') as next_pagename3
from data""").show()
输出:
+---+-----+--------------+--------------+--------------+
| id| page|next_pagename1|next_pagename2|next_pagename3|
+---+-----+--------------+--------------+--------------+
|123|login| page1| page2| page5|
|123|page1| page2| page5| page3|
|123|page2| page5| page3| Exit|
|123|page5| page3| Exit| Exit|
|123|page3| Exit| Exit| Exit|
+---+-----+--------------+--------------+--------------+
有了新数据,您可以将下一页收集到一个数组中(在我的代码中称为data
),然后过滤该数组(filtered_data
)。
spark.sql("""
with data as (
select page,array(lead(page,1)over (partition by id order by date_time asc),lead(page,2)over (partition by id order by date_time asc),3)over (partition by id order by date_time asc),4)over (partition by id order by date_time asc),5)over (partition by id order by date_time asc)) as next
from temp),filtered_data as (
select page,filter(transform(next,(x,i) -> if(i=0 or x!=next[i-1],x,null)),x -> x=x) as next
from data)
select page,ifnull(next[0],ifnull(next[1],ifnull(next[2],'Exit') as next_pagename3
from filtered_data
""").show(truncate=False)
输出:
+-----+--------------+--------------+--------------+
|page |next_pagename1|next_pagename2|next_pagename3|
+-----+--------------+--------------+--------------+
|login|page1 |page2 |page5 |
|page1|page2 |page5 |page3 |
|page2|page2 |page5 |page3 |
|page2|page5 |page3 |page2 |
|page5|page3 |page2 |Exit |
|page3|page2 |Exit |Exit |
|page2|page2 |Exit |Exit |
|page2|Exit |Exit |Exit |
+-----+--------------+--------------+--------------+
在第一个数组中收集数据时,我使用的“超前”为5。如有必要,可以通过向数组中添加更多元素来增加此数字。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。