如何解决根据 commonin R 中的最大单词数合并两个数据帧
我有两个 data.frame,一个包含部分名称,另一个包含全名,如下
partial <- data.frame( "partial.name" = c("Apple","Apple","WWF","wizz air","WeMove.eu","ILU")
full <- data.frame("full.name" = c("Apple Inc","wizzair","We Move Europe","World Wide Fundation (WWF)","(ILU)","Ilusion")
在理想的世界中,我希望有一个这样的表(我真正的部分 df 有 12 794 行)
print(partial)
partial full
Apple Apple Inc
Apple Apple Inc
WWF World Wide Fundation (WWF)
wizz air wizzair
WeMove.eu We Move Europe
... 12 794 total rows
对于没有答案的每一行,我都想成为 NA
我尝试了很多东西,fuzzyjoin
和 regex
,regex_left_join
甚至是 sqldf
包。我有一些结果,但我知道如果 regex_left_join
明白我正在寻找我在 stringr
中知道的单词,boundary( type = c("word"))
存在但我不知道如何实现它会更好。
现在,我只准备了部分 df,以去除非字母数字信息并使其小写。
partial$regex <- str_squish((str_replace_all(partial$partial.name,regex("\\W+")," ")))
partial$regex <- tolower(partial$regex)
如何根据共同词的最大数量将 partial$partial.name
与 full$full.name
匹配?
解决方法
部分字符串匹配需要很长时间才能正确匹配。我相信 Jaro-Winkler 距离是一个不错的选择,但您需要花时间调整参数。这是一个让你开始的例子。
library(stringdist)
partial <- data.frame( "partial.name" = c("Apple","Apple","WWF","wizz air","WeMove.eu","ILU",'None'),stringsAsFactors = F)
full <- data.frame("full.name" = c("Apple Inc","wizzair","We Move Europe","World Wide Foundation (WWF)","(ILU)","Ilusion"),stringsAsFactors = F)
mydist <- function(partial,list_of_fulls,method='jw',p = 0,threshold = 0.4) {
find_dist <- function(first,second,method = method,p = p) {
stringdist(a = first,b = second,p = p)
}
distances <- unlist(lapply(list_of_fulls,function(full) find_dist(first = full,second = partial,p = p)))
# If the distance is too great assume NA
if (min(distances) > threshold) {
NA
} else {
closest_index <- which.min(distances)
list_of_fulls[closest_index]
}
}
partial$match <- unlist(lapply(partial$partial.name,function(partial) mydist(partial = partial,list_of_fulls = full$full.name,method = 'jw')))
partial
# partial.name match
#1 Apple Apple Inc
#2 Apple Apple Inc
#3 WWF World Wide Foundation (WWF)
#4 wizz air wizzair
#5 WeMove.eu We Move Europe
#6 ILU (ILU)
#7 None <NA>
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。