微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

使用 dplyr/dtplyr 对 data.table 进行复杂更新的最佳方法是什么

如何解决使用 dplyr/dtplyr 对 data.table 进行复杂更新的最佳方法是什么

我们编写了一个包来分析与时间窗口相关的大量事件。

为了进行分析,我们需要建立一些窗口和交叉引用的属性 他们之间。

这是在其原生语法中使用 data.table 完成的。一些步骤的示例包含在下面的 reprex 中。

我们现在正在寻求使用 dplyr/dtplyr 重新构建此包,以提高可读性并与其他人共享 派对。

虽然我可以用 dplyr 语法编写“查询”,但我没有看到将更新应用到底层表的 tidyverse 方法 - 添加列、更新行等,而无需重复创建和替换副本。 当数据量很大时,data.table 的“就地更新”特性是非常可取的。有没有办法在 dplyr 语法中利用这一点? (我遇到了 immutable = FALSE 的障碍并尝试使用 rows_update())

library(data.table)
set.seed <- 123
#Create a table of events with timestamp and an event type (501 events randomly generated over the prevIoUs 30 days)
DT1 <- data.table(timeStamp = as.POSIXct('2021-03-25') - as.integer(runif(501)*60*1440*30),eventType=c('A','B','C'))
setkey(DT1,timeStamp)
print(DT1)
#>                timeStamp eventType
#>   1: 2021-02-23 00:42:37         A
#>   2: 2021-02-23 04:21:43         A
#>   3: 2021-02-23 05:23:51         C
#>   4: 2021-02-23 06:45:36         C
#>   5: 2021-02-23 08:34:32         B
#>  ---                              
#> 497: 2021-03-24 11:32:09         A
#> 498: 2021-03-24 13:49:53         B
#> 499: 2021-03-24 14:26:55         C
#> 500: 2021-03-24 18:11:33         C
#> 501: 2021-03-24 20:13:51         A
#Create a table of time windows.  One for each date represented with an early and late time for each
#Assign this a class (in this example the value of the most common eventType)
DT2 <- DT1[,keyby=.(date=lubridate::date(timeStamp)),.(earlyTime = min(timeStamp - 1),lateTime = max(timeStamp + 1),as = sum(eventType == 'A'),bs = sum(eventType == 'B'),cs = sum(eventType == 'C'))][,.(date,earlyTime,lateTime,class=ifelse(as >= bs & as >= cs,'A',ifelse(bs >= cs,'C')))]
print(head(DT2))
#>          date           earlyTime            lateTime class
#> 1: 2021-02-23 2021-02-23 00:42:36 2021-02-23 23:14:13     B
#> 2: 2021-02-24 2021-02-24 04:10:27 2021-02-24 21:28:14     B
#> 3: 2021-02-25 2021-02-25 03:38:29 2021-02-25 21:55:44     A
#> 4: 2021-02-26 2021-02-26 01:49:00 2021-02-26 23:40:51     B
#> 5: 2021-02-27 2021-02-27 00:18:40 2021-02-27 22:42:46     A
#> 6: 2021-02-28 2021-02-28 02:50:25 2021-02-28 22:44:44     A
#Give each row in DT2 a row number (so that we can readily cross-reference between rows)
DT2[order(lateTime),rn := .I]

#For each row,get the row number of the prevIoUs instance of this class
DT2[order(class,rn),prevOfClass := shift(rn,1),by=.(class)]
print(head(DT2))
#>          date           earlyTime            lateTime class rn prevOfClass
#> 1: 2021-02-23 2021-02-23 00:42:36 2021-02-23 23:14:13     B  1          NA
#> 2: 2021-02-24 2021-02-24 04:10:27 2021-02-24 21:28:14     B  2           1
#> 3: 2021-02-25 2021-02-25 03:38:29 2021-02-25 21:55:44     A  3          NA
#> 4: 2021-02-26 2021-02-26 01:49:00 2021-02-26 23:40:51     B  4           2
#> 5: 2021-02-27 2021-02-27 00:18:40 2021-02-27 22:42:46     A  5           3
#> 6: 2021-02-28 2021-02-28 02:50:25 2021-02-28 22:44:44     A  6           5
#For each row that is not a 'C' find the prevIoUs and next instances of a C type row
#Note that when we assigned rn we ensured that the rows were in ascending time order
#so rn can be used as a proxy for sorting by time
DT2[class=='C'][DT2[class != 'C'],on=.(rn > rn),by=.EACHI,.(rn=i.rn,nextC = min(x.rn),prevC = min(x.prevOfClass))]
#>     rn rn nextC prevC
#>  1:  1  1     8    NA
#>  2:  2  2     8    NA
#>  3:  3  3     8    NA
#>  4:  4  4     8    NA
#>  5:  5  5     8    NA
#>  6:  6  6     8    NA
#>  7:  7  7     8    NA
#>  8:  9  9    13     8
#>  9: 10 10    13     8
#> 10: 11 11    13     8
#> 11: 12 12    13     8
#> 12: 14 14    16    13
#> 13: 15 15    16    13
#> 14: 17 17    26    16
#> 15: 18 18    26    16
#> 16: 19 19    26    16
#> 17: 20 20    26    16
#> 18: 21 21    26    16
#> 19: 22 22    26    16
#> 20: 23 23    26    16
#> 21: 24 24    26    16
#> 22: 25 25    26    16
#> 23: 28 28    30    27
#> 24: 29 29    30    27
#>     rn rn nextC prevC

#But I want to add this information as additional columns to the base table
DT2[DT2[class=='C'][DT2[class != 'C'],prevC = min(x.prevOfClass))],on = .(rn),':='(nextC=i.nextC,prevC = i.prevC)
]
print(DT2[,.(rn,date,class,prevOfClass,nextC,prevC)])
#>     rn       date class prevOfClass nextC prevC
#>  1:  1 2021-02-23     B          NA     8    NA
#>  2:  2 2021-02-24     B           1     8    NA
#>  3:  3 2021-02-25     A          NA     8    NA
#>  4:  4 2021-02-26     B           2     8    NA
#>  5:  5 2021-02-27     A           3     8    NA
#>  6:  6 2021-02-28     A           5     8    NA
#>  7:  7 2021-03-01     A           6     8    NA
#>  8:  8 2021-03-02     C          NA    NA    NA
#>  9:  9 2021-03-03     A           7    13     8
#> 10: 10 2021-03-04     A           9    13     8
#> 11: 11 2021-03-05     B           4    13     8
#> 12: 12 2021-03-06     A          10    13     8
#> 13: 13 2021-03-07     C           8    NA    NA
#> 14: 14 2021-03-08     A          12    16    13
#> 15: 15 2021-03-09     B          11    16    13
#> 16: 16 2021-03-10     C          13    NA    NA
#> 17: 17 2021-03-11     A          14    26    16
#> 18: 18 2021-03-12     B          15    26    16
#> 19: 19 2021-03-13     A          17    26    16
#> 20: 20 2021-03-14     B          18    26    16
#> 21: 21 2021-03-15     A          19    26    16
#> 22: 22 2021-03-16     A          21    26    16
#> 23: 23 2021-03-17     A          22    26    16
#> 24: 24 2021-03-18     A          23    26    16
#> 25: 25 2021-03-19     B          20    26    16
#> 26: 26 2021-03-20     C          16    NA    NA
#> 27: 27 2021-03-21     C          26    NA    NA
#> 28: 28 2021-03-22     B          25    30    27
#> 29: 29 2021-03-23     A          24    30    27
#> 30: 30 2021-03-24     C          27    NA    NA
#>     rn       date class prevOfClass nextC prevC

#What would be the best approach to this using dplyr / dtplyr Syntax?
#In practice there are many hundreds of thousands of rows in the tables
#and...
#There are many more update and enrichments that need to be applied
#some of which add new columns,others will update just a few rows
#in a column
#So 'mutate in place/by reference' is highly desirable

reprex package (v1.0.0) 于 2021 年 3 月 25 日创建

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。