编写通过最高相似性匹配列名称的代码/函数

如何解决编写通过最高相似性匹配列名称的代码/函数

我有五个数据集,随着时间的推移,它们涵盖了相同的主题。

library(data.table)
DT <- data.table(A= round(rnorm(10,10,10),2),B= round(rnorm(10,C= round(rnorm(10,2))
DT_2 <- data.table(A= round(rnorm(10,D= round(rnorm(10,2))
DT_3 <- DT
DT_4 <- DT_2
DT_5 <- DT_2
names(DT)   <- c("something","nothing","anything")
names(DT_2) <- c("some thing","no thing","any thing","number4")
names(DT_3) <- c("some thing wrong","anything_")
names(DT_4) <- c("something","nothingg","anything","number_4")
names(DT_5) <- c("something","anything happening","number4")

但是,每年都有一些不同。列的名称略有更改,添加了一些列,并删除了一些列。我想“捆绑”这些数据集。但是,每个数据集都有大约100列,而手动使所有列名称保持一致将很困难。

编辑:请注意,这些列不一定具有相同的索引,例如下面的已编辑列名称中的情况,其中DT_2具有列XXX

# EDIT
names(DT)<- c("something","number4")
names(DT_2)<- c("some thing","XXX","number4")
names(DT_3)<- c("some thing wrong","anything_")
names(DT_4)<- c("something","number_4")
names(DT_5)<- c("something","number4")

我认为编写一个函数为我做一个更好的主意。

我曾经问过某功能类似here的函数的帮助。以下函数将变量名的大写和非大写版本的列合并在一起,而未指定变量名。

非常整洁,它另外指定了合并了哪些var名称。

library(data.table)
library(magrittr) # piping is used to improve readability
names(DT_panel) %>% 
  data.table(orig = .,lc = tolower(.)) %>% 
  .[,{
    if (.N > 1L) {
      new <- toupper(.BY)
      old <- setdiff(orig,new)
      DT_panel[,(new) := fcoalesce(.SD),.SDcols = orig]
      DT_panel[,(old) := NULL]
      sprintf("Coalesced %s onto %s",toString(old),new)
    }
  },by = lc]

此外,我发现了这个问题here,该问题基于列条目进行模糊连接。

library(fuzzyjoin); library(dplyr);

stringdist_join(a,b,by = "name",mode = "left",ignore_case = FALSE,method = "jw",max_dist = 99,distance_col = "dist") %>%
  group_by(name.x) %>%
  top_n(1,-dist)

问题是我对这两种解决方案都不足够了解,无法将它们组合成一个提供所需解决方案的功能。

有人可以帮助我开始吗?我想要的输出如下:

DT <- data.table(A= round(rnorm(10,2))
D <- c(NA,NA,NA)
DT_3 <- DT
DT_4 <- DT_2
DT_5 <- DT_2
DT <- cbind(DT,D)
DT_3 <- cbind(DT_3,D)
DT <- rbind (DT,DT_2,DT_3,DT_4,DT_5)
names(DT) <- c("something","number4")

解决方法

此方法基于fuzzyjoin::stringdist_join。它可以处理新列和已删除列。

从一些虚拟数据开始。

library(tidyverse)

df1 <- tibble("something" = 1,"nothing" = 2,"anything" = 3,"number4" = 4)
df2 <- tibble("some thing" = 1,"no thing" = 2,"XXX" = 99,"number4" = 4)
df3 <- tibble("some thing wrong" = 1,"anything_" = 4)
df4 <- tibble("something" = 1,"nothingg" = 2,"anything" = 2,"number_4" = 4,"YYY" = 100)
df5 <- tibble("something" = 1,"anything happening" = 2,"number4" = 4)

fuzzy_rowbind模糊合并两个数据帧。它使用fuzzyjoin::stringdist_join来标识最相似的列。第二个数据框的列被重命名并合并。

fuzzy_rowbind <- function(a,b,method = "cosine",max_dist = 0.9999) {
  a_name_df <- tibble(name = names(a))
  b_name_df <- tibble(name = names(b))
  
  fj <- 
    fuzzyjoin::stringdist_join(
      a_name_df,b_name_df,by = "name",mode = "left",ignore_case = FALSE,method = method,max_dist = max_dist,distance_col = "dist"
    ) %>%
    arrange(dist)
  
  name_mapping <- NULL
  while (nrow(fj) > 0 && !all(b_name_df$name %in% name_mapping$name.y)) {
    name_mapping <- bind_rows(name_mapping,fj %>% slice(1))
    
    fj <- fj %>% filter(!name.x %in% name_mapping$name.x,!name.y %in% name_mapping$name.y)
  }
  
  new_names <- setNames(name_mapping$name.y,name_mapping$name.x)
  
  b_renamed <- rename(b,new_names[!is.na(new_names)])
  
  enframe(new_names,name = "new_name",value = "original_name") %>%
    filter(new_name != original_name,!is.na(new_name)) %>%
    as.data.frame() %>%
    print()
  cat("\n")
  
  bind_rows(a,b_renamed)
}

例如,当我们将df1df2结合在一起时,会发生以下情况。

fuzzy_rowbind(df1,df2)
#>    new_name original_name
#> 1 something    some thing
#> 2   nothing      no thing
#> 
#> # A tibble: 2 x 5
#>   something nothing anything number4   XXX
#>       <dbl>   <dbl>    <dbl>   <dbl> <dbl>
#> 1         1       2        3       4    NA
#> 2         1       2       NA       4    99

接下来,定义fuzzy_rowbind_all,它可以获取数据帧列表并将它们组合在一起。

fuzzy_rowbind_all <- function(l) {
  last(accumulate(l,fuzzy_rowbind))
}

此处fuzzy_rowbind_all用于我们的数据帧。

fuzzy_rowbind_all(
  lst(df1,df2,df3,df4,df5)
)
#>    new_name original_name
#> 1 something    some thing
#> 2   nothing      no thing
#> 
#>    new_name    original_name
#> 1  anything        anything_
#> 2 something some thing wrong
#> 
#>   new_name original_name
#> 1  nothing      nothingg
#> 2  number4      number_4
#> 
#>   new_name      original_name
#> 1 anything anything happening
#> 
#> # A tibble: 5 x 6
#>   something nothing anything number4   XXX   YYY
#>       <dbl>   <dbl>    <dbl>   <dbl> <dbl> <dbl>
#> 1         1       2        3       4    NA    NA
#> 2         1       2       NA       4    99    NA
#> 3         1       2        4      NA    NA    NA
#> 4         1       2        2       4    NA   100
#> 5         1       2        2       4    NA    NA

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-
参考1 参考2 解决方案 # 点击安装源 协议选择 http:// 路径填写 mirrors.aliyun.com/centos/8.3.2011/BaseOS/x86_64/os URL类型 软件库URL 其他路径 # 版本 7 mirrors.aliyun.com/centos/7/os/x86
报错1 [root@slave1 data_mocker]# kafka-console-consumer.sh --bootstrap-server slave1:9092 --topic topic_db [2023-12-19 18:31:12,770] WARN [Consumer clie
错误1 # 重写数据 hive (edu)&gt; insert overwrite table dwd_trade_cart_add_inc &gt; select data.id, &gt; data.user_id, &gt; data.course_id, &gt; date_format(
错误1 hive (edu)&gt; insert into huanhuan values(1,&#39;haoge&#39;); Query ID = root_20240110071417_fe1517ad-3607-41f4-bdcf-d00b98ac443e Total jobs = 1
报错1:执行到如下就不执行了,没有显示Successfully registered new MBean. [root@slave1 bin]# /usr/local/software/flume-1.9.0/bin/flume-ng agent -n a1 -c /usr/local/softwa
虚拟及没有启动任何服务器查看jps会显示jps,如果没有显示任何东西 [root@slave2 ~]# jps 9647 Jps 解决方案 # 进入/tmp查看 [root@slave1 dfs]# cd /tmp [root@slave1 tmp]# ll 总用量 48 drwxr-xr-x. 2
报错1 hive&gt; show databases; OK Failed with exception java.io.IOException:java.lang.RuntimeException: Error in configuring object Time taken: 0.474 se
报错1 [root@localhost ~]# vim -bash: vim: 未找到命令 安装vim yum -y install vim* # 查看是否安装成功 [root@hadoop01 hadoop]# rpm -qa |grep vim vim-X11-7.4.629-8.el7_9.x
修改hadoop配置 vi /usr/local/software/hadoop-2.9.2/etc/hadoop/yarn-site.xml # 添加如下 &lt;configuration&gt; &lt;property&gt; &lt;name&gt;yarn.nodemanager.res