如何解决Spark数据框:按星期在日期窗口中对唯一身份个人的组URL访问
我将Spark数据框和Scala与如下数据框一起使用:
User Id | Date | Url
-------------------------------------------
1 |2020-08-30 | https://example.com/2
2 |2020-08-15 | https://example.com/1
1 |2020-08-01 | https://example.com/3
3 |2020-08-18 | https://example.com/1
1 |2020-08-02 | https://example.com/1
2 |2020-08-04 | https://example.com/2
1 |2020-08-22 | https://example.com/8
4 |2020-08-08 | https://example.com/8
1 |2020-08-29 | https://example.com/4
2 |2020-08-12 | https://example.com/6
1 |2020-08-01 | https://example.com/3
3 |2020-08-18 | https://example.com/7
1 |2020-08-03 | https://example.com/1
2 |2020-08-04 | https://example.com/2
1 |2020-08-23 | https://example.com/6
4 |2020-08-08 | https://example.com/5
...
(有点像访问日志)
我想按周和url将其分组,并包括以下独特用户:
URL |Week |Unique user count
------------------------------------------------------------------------
https://example.com/1 |today to today-7 days | 5
https://example.com/1 |today-7 days to today-14 days | 3
https://example.com/2 |today to today-7 days | 1
https://example.com/2 |today-7 days to today-14 days | 4
https://example.com/3 |today to today-7 days | 6
https://example.com/3 |today-7 days to today-14 days | 4
https://example.com/4 |today to today-7 days | 2
https://example.com/4 |today-7 days to today-14 days | 3
https://example.com/5 |today to today-7 days | 12
https://example.com/5 |today-7 days to today-14 days | 8
https://example.com/6 |today to today-7 days | 6
https://example.com/6 |today-7 days to today-14 days | 4
https://example.com/7 |today to today-7 days | 5
https://example.com/7 |today-7 days to today-14 days | 3
https://example.com/8 |today to today-7 days | 1
https://example.com/8 |today-7 days to today-14 days | 4
我是Spark和DataFrame的新手,但我假设我想使用partitionBy和Windows。
如果一天中有任何用户多次访问,我不想重复计算。
到目前为止,我已经尝试过:
val dfWithNewColumn= startingDf.withColumn("timeframe_indicator",when((col("Date") >= sevenDaysAgoDate,"timeframe_1")
when((col("Date") < sevenDaysAgoDate)
&& (col("Date") >= fourteenDaysAgoDate),"timeframe_2")
otherwise("outside_timeframes"))
val dupesOut = dfWithNewColumn.dropDuplicates("Date","User Id","Url")
val grouped = dupesOut.groupBy("Url","timeframe_indicator").count().withColumnRenamed("count","Unique User Count")
它似乎正在工作,但我希望尽可能地看看如何使用窗口。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。