微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

查找隶属关系中的城市名称,并将它们与其对应的国家/地区一起添加到数据框的新列中

如何解决查找隶属关系中的城市名称,并将它们与其对应的国家/地区一起添加到数据框的新列中

我有一个包含城市名称的隶属关系数据框“dfa”,有时会缺少国家/地区名称,例如像第 4 行(巴格达)和第 7 行(柏林):

dfa <- data.frame(affiliation=c("DEPARTMENT OF PHARMACY,AMSTERdam UNIVERSITY,AMSTERdam,THE NETHERLANDS","DEPARTMENT OF BIOCHEMISTRY,LADY HARDINGE MEDICAL COLLEGE,NEW DELHI,INDIA.","DEPARTMENT OF PATHOLOGY,CHILDREN'S HOSPITAL,LOS ANGELES,UNITED STATES","COLLEGE OF EDUCATION FOR PURE SCIENCE,UNIVERSITY OF BAGHDAD.","DEPARTMENT OF CLINICAL LABORATORY,BEIJING GENERAL HOSPITAL,BEIJING,CHINA.","LABORATORY OF MOLEculaR BIOLOGY,ISTITUTO ORTOPEDICO,MILAN,ITALY.","DEPARTMENT OF AGRICULTURE,BERLIN INSTITUTE OF HEALTH,BERLIN","INSTITUTE OF LABORATORY MEDICINE,UNIVERSITY HOSPITAL,MUNICH,GERMANY.","DEPARTMENT OF CLINICAL PATHOLOGY,MAHIDOL UNIVERSITY,BANGKOK,THAILAND.","DEPARTMENT OF BIOLOGY,WASEDA UNIVERSITY,TOKYO,JAPAN","DEPARTMENT OF MOLEculaR BIOLOGY,MINISTRY OF HEALTH,TEHRAN,IRAN.","LABORATORY OF CARdioVASculaR disEASE,FUWAI HOSPITAL,CHINA."))

我现在有第二个数据框“dfb”,其中包含城市和相应国家/地区的列表,其中一些存在于“dfa”中:

dfb <- data.frame(city=c("AGRI","AMSTERdam","athens","AUCKLAND","BUENOS AIRES","BEIJING","BAGHDAD","BANGKOK","BERLIN","BUDApest"),country=c("TURKEY","NETHERLANDS","GREECE","NEW ZEALAND","ARGENTINA","CHINA","IRAQ","THAILAND","GERMANY","HUNGARY"))

如何仅针对同时出现在“dfa”和“dfb”中的城市(即使缺少国家/地区,如巴格达和柏林)在两个新列中添加城市和相应国家/地区?

注意:目标是添加完整城市名称,但不是其中的一部分。下面的第 7 行是不想要的示例:AGRI 城市 TURKEY 与 BERLIN 不恰当地相关联,因为该行包含“AGRICULTURE”字样。

有没有一种简单的方法可以做到这一点,最好使用 dplyr?

    affiliation      city     country
1      DEPARTMENT OF PHARMACY,THE NETHERLANDS AMSTERdam NETHERLANDS
2  DEPARTMENT OF BIOCHEMISTRY,INDIA.      <NA>        <NA>
3      DEPARTMENT OF PATHOLOGY,UNITED STATES      <NA>        <NA>
4                 COLLEGE OF EDUCATION FOR PURE SCIENCE,UNIVERSITY OF BAGHDAD.   BAGHDAD        IRAQ
5  DEPARTMENT OF CLINICAL LABORATORY,CHINA.   BEIJING       CHINA
6           LABORATORY OF MOLEculaR BIOLOGY,ITALY.      <NA>        <NA>
7                 DEPARTMENT OF AGRICULTURE,BERLIN      AGRI      TURKEY
8       INSTITUTE OF LABORATORY MEDICINE,GERMANY.      <NA>        <NA>
9      DEPARTMENT OF CLINICAL PATHOLOGY,THAILAND.   BANGKOK    THAILAND
10                       DEPARTMENT OF BIOLOGY,JAPAN      <NA>        <NA>
11           DEPARTMENT OF MOLEculaR BIOLOGY,IRAN.      <NA>        <NA>
12        LABORATORY OF CARdioVASculaR disEASE,CHINA.   BEIJING       CHINA

解决方法

str_extract 与连接或另一个 str_extract 的组合是帮助您实现目标的一种选择。 str_extract 将获得它遇到的第一个值,并使用 paste0 将城市折叠成一个长 or 字符串以进行检查。

library(dplyr)
library(stringr)

dfa %>% 
  mutate(city = str_extract(dfa$affiliation,paste0("\\b",dfb$city,collapse = "\\b|"))) %>% 
  left_join(dfb,by = "city")

编辑:在 paste0 中添加了单词边界,以便仅匹配整个城市名称并避免部分匹配。

    affiliation      city     country
1      DEPARTMENT OF PHARMACY,AMSTERDAM UNIVERSITY,AMSTERDAM,THE NETHERLANDS AMSTERDAM NETHERLANDS
2  DEPARTMENT OF BIOCHEMISTRY,LADY HARDINGE MEDICAL COLLEGE,NEW DELHI,INDIA.      <NA>        <NA>
3      DEPARTMENT OF PATHOLOGY,CHILDREN'S HOSPITAL,LOS ANGELES,UNITED STATES      <NA>        <NA>
4                 COLLEGE OF EDUCATION FOR PURE SCIENCE,UNIVERSITY OF BAGHDAD.   BAGHDAD        IRAQ
5  DEPARTMENT OF CLINICAL LABORATORY,BEIJING GENERAL HOSPITAL,BEIJING,CHINA.   BEIJING       CHINA
6           LABORATORY OF MOLECULAR BIOLOGY,ISTITUTO ORTOPEDICO,MILAN,ITALY.      <NA>        <NA>
7                 DEPARTMENT OF AGRICULTURE,BERLIN INSTITUTE OF HEALTH,BERLIN    BERLIN     GERMANY
8       INSTITUTE OF LABORATORY MEDICINE,UNIVERSITY HOSPITAL,MUNICH,GERMANY.      <NA>        <NA>
9      DEPARTMENT OF CLINICAL PATHOLOGY,MAHIDOL UNIVERSITY,BANGKOK,THAILAND.   BANGKOK    THAILAND
10                       DEPARTMENT OF BIOLOGY,WASEDA UNIVERSITY,TOKYO,JAPAN      <NA>        <NA>
11           DEPARTMENT OF MOLECULAR BIOLOGY,MINISTRY OF HEALTH,TEHRAN,IRAN.      <NA>        <NA>
12        LABORATORY OF CARDIOVASCULAR DISEASE,FUWAI HOSPITAL,CHINA.   BEIJING       CHINA
,

这种方法解释了从属关系可能与多个城市名称匹配的可能性。

library(tidyverse)

dfa %>% 
  mutate(city = map(affiliation,~ str_extract(.x,dfb$city))) %>% 
  unnest(cols = c(city)) %>% 
  group_by(affiliation) %>% 
  mutate(nmatches = sum(!is.na(city))) %>% 
  filter((nmatches > 0 & !is.na(city)) | (nmatches == 0 & row_number() == 1)) %>%
  ungroup() %>% 
  left_join(dfb,by = "city") %>% 
  mutate(country_match = str_detect(affiliation,country))

# A tibble: 12 x 5
   affiliation              city   nmatches country country_match
   <chr>                    <chr>     <int> <chr>   <lgl>        
 1 DEPARTMENT OF PHARMACY,… AMSTE…        1 NETHER… TRUE         
 2 DEPARTMENT OF BIOCHEMIS… NA            0 NA      NA           
 3 DEPARTMENT OF PATHOLOGY… NA            0 NA      NA           
 4 COLLEGE OF EDUCATION FO… BAGHD…        1 IRAQ    FALSE        
 5 DEPARTMENT OF CLINICAL … BEIJI…        1 CHINA   TRUE         
 6 LABORATORY OF MOLECULAR… NA            0 NA      NA           
 7 BERLIN INSTITUTE OF HEA… BERLIN        1 GERMANY FALSE        
 8 INSTITUTE OF LABORATORY… NA            0 NA      NA           
 9 DEPARTMENT OF CLINICAL … BANGK…        1 THAILA… TRUE         
10 DEPARTMENT OF BIOLOGY,… NA            0 NA      NA           
11 DEPARTMENT OF MOLECULAR… NA            0 NA      NA           
12 LABORATORY OF CARDIOVAS… BEIJI…        1 CHINA   TRUE   

然后您可以使用 1 nmatchescountry_match == F 仔细检查案例,当有 2 个或更多 nmatches 时,您可以使用 country_match == T 保留那个。>

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其他元素将获得点击?
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。)
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbcDriver发生异常。为什么?
这是用Java进行XML解析的最佳库。
Java的PriorityQueue的内置迭代器不会以任何特定顺序遍历数据结构。为什么?
如何在Java中聆听按键时移动图像。
Java“Program to an interface”。这是什么意思?