假设我有两个data.tables:
summary <- data.table(period = c("A","B","C","D"),from_date = ymd(c("2017-01-01","2017-01-03","2017-02-08","2017-03-07")),to_date = ymd(c("2017-01-31","2017-04-01","2017-03-08","2017-05-01")) ) log <- data.table(date = ymd(c("2017-01-03","2017-01-20","2017-02-01","2017-03-03","2017-03-15","2017-03-28","2017-04-03","2017-04-23")),event1 = c(4,8,4,3,7,3),event2 = c(1,6,3))
看起来像这样:
> summary period from_date to_date 1: A 2017-01-01 2017-01-31 2: B 2017-01-03 2017-04-01 3: C 2017-02-08 2017-03-08 4: D 2017-03-07 2017-05-01 > log date event1 event2 1: 2017-01-03 4 1 2: 2017-01-20 8 8 3: 2017-02-01 8 7 4: 2017-03-03 4 3 5: 2017-03-15 3 8 6: 2017-03-28 4 4 7: 2017-04-03 7 6 8: 2017-04-23 3 3
我想在表摘要中获取每个时间段的event1和event2的总和.
我知道我可以这样做:
summary[,c("event1","event2") := .(sum(log[date>=from_date & date<=to_date,event1]),sum(log[date>=from_date & date<=to_date,event2])),by=period][]
获得所需的结果:
period from_date to_date event1 event2 1: A 2017-01-01 2017-01-31 12 9 2: B 2017-01-03 2017-04-01 31 31 3: C 2017-02-08 2017-03-08 4 3 4: D 2017-03-07 2017-05-01 17 21
现在,在我的现实问题中,我有大约30个要汇总的列,我可能想稍后更改,汇总有~35,000行,日志有~40,000,000行.有没有一种有效的方法来实现这一目标?
解决方法
是的,您可以执行非equi连接.
(注意我已将日志和摘要更改为Log和Summary,因为原件已经是R中的函数.)
Log[Summary,on = c("date>=from_date","date<=to_date"),nomatch=0L,allow.cartesian = TRUE][,.(from_date = date[1],to_date = date.1[1],event1 = sum(event1),event2 = sum(event2)),keyby = "period"]
要总结列的模式,请使用lapply和.SD:
joined_result <- Log[Summary,nomatch = 0L,allow.cartesian = TRUE] cols <- grep("event[a-z]?[0-9]",names(joined_result),value = TRUE) joined_result[,lapply(.SD,sum),.SDcols = cols,keyby = .(period,from_date = date,to_date = date.1)]
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。