如何解决如何基于以字符串形式写入的相似但不准确的时间变量合并两个数据集使用 R?
我有两个数据集要合并,看起来像这样
time price
0 1103 5
1 0010 10
2 0100 6
3 0201 8
4 0015 7
5 0400 4
6 0500 6
7 0800 3
8 1000 4
9 1140 5
10 1400 2
11 0030 1
12 0112 0
通常如果我运行基于完全匹配的脚本
df1
ID|date | time|
1 04/06/21 "05:02:06"
2 05/03/21 "04:12:11"
3 02/02/20 "03:02:10"
4 09/09/20 "09:12:14"
5 02/02/21 "15:18:20"
6 04/04/21 "14:00:00"
df2
2ID|date | time|
1 04/06/21 "05:12:06"
2 05/03/21 "04:08:11"
3 02/02/20 "03:09:10"
4 09/09/20 "09:12:14"
5 02/02/21 "15:18:20"
6 04/04/21 "15:00:00"
我会得到
df3 <- df2 %>% left_join(df1,by=c("incident_date","incident_time"))
请注意,我只会匹配四分之二的值,但是我希望匹配时间接近的四分之三的其余三个变量。我希望脚本在 45 分钟内做一个给予或接受,最终看起来像这样
ID| date|time |2ID
1 04/06/21 "05:02:06"
2 05/03/21 "04:12:11"
3 02/02/20 "03:02:10"
4 09/09/20 "09:12:14" 4
5 02/02/21 "15:18:20" 5
6 12/14/22 "14:00:00"
我试图根据较早的堆栈溢出问题来做这样的事情,但它无法工作。有谁知道如何做到这一点
来源:Merge based on similar but not exact dates
ID| date|time |2ID
1 04/06/21 "05:02:06" 1
2 05/03/21 "04:12:11" 2
3 02/02/20 "03:02:10" 3
4 09/09/20 "09:12:14" 4
5 02/02/21 "15:18:20" 5
6 12/14/22 "14:00:00"
解决方法
我认为 fuzzyjoin
包最适合这种情况。
我将向两个帧添加一个 $tm
(POSIXct
) 列,因为这是获得清晰的“计算差异”(以秒为单位)所必需的。
df1$tm <- as.POSIXct(paste(df1$date,df1$time),format="%m/%d/%Y %H:%M:%S")
df2$tm <- as.POSIXct(paste(df2$date,df2$time),format="%m/%d/%Y %H:%M:%S")
fuzzyjoin::difference_left_join(df1,df2,by = "tm",max_dist = 45*60)
# ID.x date.x time.x tm.x ID.y date.y time.y tm.y
# 1 1 04/06/21 05:02:06 0021-04-06 05:02:06 1 04/06/21 05:12:06 0021-04-06 05:12:06
# 2 2 05/03/21 04:12:11 0021-05-03 04:12:11 2 05/03/21 04:08:11 0021-05-03 04:08:11
# 3 3 02/02/20 03:02:10 0020-02-02 03:02:10 3 02/02/20 03:09:10 0020-02-02 03:09:10
# 4 4 09/09/20 09:12:14 0020-09-09 09:12:14 4 09/09/20 09:12:14 0020-09-09 09:12:14
# 5 5 02/02/21 15:18:20 0021-02-02 15:18:20 5 02/02/21 15:18:20 0021-02-02 15:18:20
# 6 6 04/04/21 14:00:00 0021-04-04 14:00:00 NA <NA> <NA> <NA>
显然需要大量的名称清理,这个怎么样:
fuzzyjoin::difference_left_join(df1,df2[,c("ID","tm")],max_dist = 45*60) %>%
select(ID = ID.x,date,time,ID2 = ID.y)
# ID date time ID2
# 1 1 04/06/21 05:02:06 1
# 2 2 05/03/21 04:12:11 2
# 3 3 02/02/20 03:02:10 3
# 4 4 09/09/20 09:12:14 4
# 5 5 02/02/21 15:18:20 5
# 6 6 04/04/21 14:00:00 NA
注意:可以找到多个匹配项(如果多个事件在 45 分钟内),因此您可能需要添加分组过滤器:
... %>%
group_by(ID.x) %>%
filter(which.min(abs(tm.x - tm.y)))
(需要在我重命名和删除 tm.*
字段之前完成)
数据
df1 <- structure(list(ID = 1:6,date = c("04/06/21","05/03/21","02/02/20","09/09/20","02/02/21","04/04/21"),time = c("05:02:06","04:12:11","03:02:10","09:12:14","15:18:20","14:00:00")),class = "data.frame",row.names = c(NA,-6L))
df2 <- structure(list(ID = 1:6,time = c("05:12:06","04:08:11","03:09:10","15:00:00")),-6L))
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。