微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

如何在 sparklyr 中随时间计算唯一值给出的示例?

如何解决如何在 sparklyr 中随时间计算唯一值给出的示例?

我正在尝试根据时间戳(四舍五入到分钟)计算过去 10 分钟内看到的唯一设备。我可以在 data.table 中做到这一点,但不知道如何在 R 中的 sparklyr 中复制相同的内容。540 指的是 # 秒添加到当前时间戳。

下面提供了一个例子来解释我的问题。

给定数据

df<-data.frame(device_subscriber_id=c("x","a","z","x","y","z"),start_timestamp=c("2020-12-11 14:21:00","2020-12-11 14:22:00","2020-12-11 14:23:00","2020-12-11 14:26:00","2020-12-11 14:24:00","2020-12-11 14:25:00","2020-12-11 14:28:00","2020-12-11 14:31:00","2020-12-11 14:38:00"))

df$start_timestamp<-as.POSIXct(df$start_timestamp,format="%Y-%m-%d %H:%M:%s")
dt<-setDT(df)

预期数据

expected_dt<-dt[dt[,.(start_timestamp3=start_timestamp,start_timestamp2 = start_timestamp - 540,device_subscriber_id)],on = .(start_timestamp >= start_timestamp2,start_timestamp<=start_timestamp3),allow.cartesian = TRUE][,.(unique_devices_seen = uniqueN(device_subscriber_id)),by = .(start_timestamp = start_timestamp + 540)]

expected_dt

   start_timestamp unique_devices_seen
   2020-12-11 14:21:00                   1
   2020-12-11 14:22:00                   2
   2020-12-11 14:23:00                   3
   2020-12-11 14:26:00                   3
   2020-12-11 14:24:00                   3
   2020-12-11 14:25:00                   3
   2020-12-11 14:28:00                   4
   2020-12-11 14:31:00                   4
   2020-12-11 14:38:00                   2

解决方法

我建议在当前行和前 540 秒之间使用 SQL 窗口函数 OVERcount(distinct device_subscriber_id)Error: org.apache.spark.sql.AnalysisException: Distinct window functions are not supported。解决方法是收集一组唯一 ID 并返回数组的大小。为了使用以秒为单位的范围值,时间戳被转换为纪元。

library(sparklyr)
library(tidyverse)
sc <- spark_connect(master="local[4]",version = "3.0.1")

sdf <- copy_to(sc,df,name = "df",overwrite = TRUE)

sdf_sql(sc,"
SELECT 
  start_timestamp,size(collect_set(device_subscriber_id) 
       OVER (ORDER BY start_ts_epoch ASC 
             RANGE BETWEEN 540 PRECEDING AND CURRENT ROW)) as unique_devices_seen
FROM (SELECT *,unix_timestamp(start_timestamp) as start_ts_epoch FROM `df`)")

结果:

# Source: spark<?> [?? x 2]
   start_timestamp     unique_devices_seen
   <dttm>                            <int>
 1 2020-12-11 13:21:00                   1
 2 2020-12-11 13:22:00                   2
 3 2020-12-11 13:23:00                   3
 4 2020-12-11 13:24:00                   3
 5 2020-12-11 13:25:00                   3
 6 2020-12-11 13:26:00                   3
 7 2020-12-11 13:26:00                   3
 8 2020-12-11 13:28:00                   4
 9 2020-12-11 13:31:00                   4
10 2020-12-11 13:38:00                   2

参考Spark SQL Window Functions API


奖励:如果需要缺少时间戳,您需要将设备数据与包含所有可能时间戳的表连接起来。缺少的时间戳会将设备 ID 设为空值,并且不会影响计数。

df_ts <- data.frame(start_timestamp=seq(min(df$start_timestamp),max(df$start_timestamp),by = "min"))
sdf_ts <- copy_to(sc,df_ts,name = "df_ts","
SELECT DISTINCT
  start_timestamp,size(collect_set(device_subscriber_id) 
         OVER (ORDER BY start_ts_epoch ASC 
               RANGE BETWEEN 540 PRECEDING AND CURRENT ROW)) as unique_devices_seen,concat_ws(',',collect_set(device_subscriber_id)
                   OVER (ORDER BY start_ts_epoch ASC 
                   RANGE BETWEEN 540 PRECEDING AND CURRENT ROW)) as unique_devices_seen_csv
FROM (SELECT 
        device_subscriber_id,df_ts.start_timestamp,unix_timestamp(df_ts.start_timestamp) as start_ts_epoch
      FROM df
      FULL JOIN df_ts ON (df.start_timestamp = df_ts.start_timestamp))") %>% print(n=30)

请注意,我添加了 unique_devices_seen_csv 以显示幕后发生的事情。它连接滑动窗口的设备 ID。

结果:

# Source: spark<?> [?? x 3]
   start_timestamp     unique_devices_seen unique_devices_seen_csv
   <dttm>                            <int> <chr>                  
 1 2020-12-11 13:21:00                   1 x                      
 2 2020-12-11 13:22:00                   2 x,a                    
 3 2020-12-11 13:23:00                   3 z,x,a                  
 4 2020-12-11 13:24:00                   3 z,a                  
 5 2020-12-11 13:25:00                   3 z,a                  
 6 2020-12-11 13:26:00                   3 z,a                  
 7 2020-12-11 13:27:00                   3 z,a                  
 8 2020-12-11 13:28:00                   4 z,y,a                
 9 2020-12-11 13:29:00                   4 z,a                
10 2020-12-11 13:30:00                   4 z,a                
11 2020-12-11 13:31:00                   4 z,a                
12 2020-12-11 13:32:00                   4 z,a                
13 2020-12-11 13:33:00                   4 z,a                
14 2020-12-11 13:34:00                   4 z,a                
15 2020-12-11 13:35:00                   3 y,a                  
16 2020-12-11 13:36:00                   2 y,a                    
17 2020-12-11 13:37:00                   2 y,a                    
18 2020-12-11 13:38:00                   2 z,a
,

如果使用 SQL 计数,我们可以使用 来查询 Spark 集群:

library(data.table)
library(sparklyr)

sc <- spark_connect(master = "local")
copy_to(sc,dt)

sdf_sql(sc,"
SELECT COUNT(DISTINCT dt1.device_subscriber_id) as unique_devices_seen,dt2.start_timestamp

FROM dt dt1
INNER JOIN dt dt2 ON dt1.start_timestamp >= dt2.start_timestamp - INTERVAL 9 minutes
                   AND dt1.start_timestamp <= dt2.start_timestamp

GROUP BY dt2.start_timestamp

ORDER BY start_timestamp
           ")


## # Source: spark<?> [?? x 2]
##   unique_devices_seen start_timestamp    
##                 <dbl> <dttm>             
## 1                   1 2020-12-11 19:21:00
## 2                   2 2020-12-11 19:22:00
## 3                   3 2020-12-11 19:23:00
## 4                   3 2020-12-11 19:24:00
## 5                   3 2020-12-11 19:25:00
## 6                   3 2020-12-11 19:26:00
## 7                   4 2020-12-11 19:28:00
## 8                   4 2020-12-11 19:31:00
## 9                   2 2020-12-11 19:38:00

SQL 似乎是一个很好的中间立场 - 非常适合转换为 SQL。

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?