微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

单列中的模糊匹配字符串并记录可能的匹配项

如何解决单列中的模糊匹配字符串并记录可能的匹配项

我有一个大约 5k 行的相对较大的数据集,其中包含期刊/研究论文的标题。这是数据集的一个小样本:

dt = structure(list(Title = c("Community reinforcement approach in the treatment of opiate addicts","Therapeutic justice: Life inside drug court","Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comprehensive approach for integrated care","An ecosystem for improving the quality of personal health records","Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders","A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders","A model for the assessment of static and dynamic factors in sexual offenders","The problem of co-occurring disorders among jail detainees: Antisocial disorder,alcoholism,drug abuse,and depression","Co-occurring disorders among mentally ill jail detainees. Implications for public policy","Comorbidity and Continuity of Psychiatric disorders in Youth After Detention: A Prospective Longitudinal Study","Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure","Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure","Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0","Diagnosis of active and latent tuberculosis: summary of NICE guidance","Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium"
)),row.names = c(NA,-16L),class = c("tbl_df","tbl","data.frame"
))

您可以看到其中有一些标题重复,但格式/大小写有所不同。我想识别重复的标题并创建一个新变量来记录哪些行可能匹配。为此,我尝试使用 agrep 函数 as suggested here :

dt$is.match <- sapply(dt$Title,agrep,dt$Title)

这会识别匹配项,但会将结果保存为新变量列中的列表。有没有办法做到这一点(最好使用 base r 或 data.table),其中 agrep 的结果不保存为列表,而只识别哪些行是匹配的(例如,6:7)?

提前致谢 - 希望我提供了足够的信息。

解决方法

这不是基础 r 也不是 data.table,而是使用 tidyverse 检测重复项的一种方法:

library(janitor)
library(tidyverse)

dt %>% 
  mutate(row = row_number()) %>% 
  get_dupes(Title)

输出:

# A tibble: 2 x 3
  Title                                       dupe_count   row
  <chr>                                            <int> <int>
1 Therapeutic justice: Life inside drug court          2     2
2 Therapeutic justice: Life inside drug court          2     3

如果您想挑选不区分大小写的重复项,请尝试以下操作:

dt %>% 
  mutate(Title = str_to_lower(Title),row = row_number()) %>% 
  get_dupes(Title)

输出:

# A tibble: 6 x 3
  Title                                                                      dupe_count   row
  <chr>                                                                           <int> <int>
1 a model for the assessment of static and dynamic factors in sexual offend…          2     7
2 a model for the assessment of static and dynamic factors in sexual offend…          2     8
3 behavioral health and adult milestones in young adults with perinatal hiv…          2    12
4 behavioral health and adult milestones in young adults with perinatal hiv…          2    13
5 therapeutic justice: life inside drug court                                         2     2
6 therapeutic justice: life inside drug court                                         2     3
,

你需要这样的东西吗?

dt$is.match <- sapply(dt$Title,function(x) toString(agrep(x,dt$Title)),USE.NAMES = FALSE)

dt
# A tibble: 16 x 2
#   Title                                                                                                    is.match
#   <chr>                                                                                                    <chr>   
# 1 Community reinforcement approach in the treatment of opiate addicts                                      1       
# 2 Therapeutic justice: Life inside drug court                                                              2,3    
# 3 Therapeutic justice: Life inside drug court                                                              2,3    
# 4 Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comp… 4       
# 5 An ecosystem for improving the quality of personal health records                                        5       
# 6 Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders                      6       
# 7 A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders                             7,8    
# 8 A model for the assessment of static and dynamic factors in sexual offenders                             7,8    
# 9 The problem of co-occurring disorders among jail detainees: Antisocial disorder,alcoholism,drug abuse… 9       
#10 Co-occurring disorders among mentally ill jail detainees. Implications for public policy                 10      
#11 Comorbidity and Continuity of Psychiatric Disorders in Youth After Detention: A Prospective Longitudina… 11      
#12 Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure          12,13  
#13 Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure          12,13  
#14 Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0 14      
#15 Diagnosis of active and latent tuberculosis: summary of NICE guidance                                    15      
#16 Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium     16     

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。