如何解决如何逐行比较两个数据框?
大约一个月前,我发布了原始问题,我需要逐行比较两个数据框,并标记 df2(第二个文件)中与 df1(第一个文件)不匹配的行。解决方案是使用反连接。这很有效,直到我添加了一个带有文本字符串的附加列。我还需要在比较中包含该列并检测文本字符串的哪些记录不匹配。
附件是示例数据框。我需要将 df2 与 df1 进行比较,并显示 df2 中的哪些行与 df1 不匹配。我可以使用 R 中的反连接来显示哪些行不匹配,但是当我在行中有文本字符串时它不起作用。
df1
Product basecode A B C D E F
Tractor A810 382 512 363 553 530 A dog ran fast
Tractor A773 222 155 650 278 215 A dog ran fast
Tractor A203 382 512 363 553 530 A dog ran fast
Tractor A329 332 459 251 341 475 A dog ran fast
Combine B244 244 714 467 122 340 A dog ran fast
Combine B302 257 758 230 704 715 A dog ran fast
Combine B681 670 626 572 795 323 A dog ran fast
Combine B514 768 510 546 542 582 A dog ran fast
Sprayer C850 553 624 557 660 337 A dog ran fast
Sprayer C202 561 733 443 107 526 A dog ran fast
Sprayer C619 256 226 257 770 633 A dog ran fast
Sprayer C292 256 226 257 770 633 A dog ran fast
SPFH D126 323 597 647 159 317 A dog ran fast
SPFH D307 711 535 323 793 769 A dog ran fast
SPFH D355 155 744 772 689 509 A dog ran fast
SPFH D893 155 744 772 689 509 A dog ran fast
df2
Product basecode A B C D E F
Tractor A810 382 512 363 553 530 A dog ran fast
Tractor A773 222 155 650 278 215 A dog ran fast
Tractor A203 382 512 363 553 530 A dog ran fast
Tractor A329 332 459 251 341 475 A dog ran fast
Combine B 244 244 714 467 122 340 A dog ran fast
Combine B302 257 758 230 704 715 A dog ran fast
Combine B681 670 626 572 795 323 A dog ran fast
Combine B514 768 510 546 542 582 A dog ran fast
Sprayer C850 553 624 557 660 337 A dog ran fast
Sprayer C202 561 733 443 107 526 A dog ran fast
Sprayer C619 256 226 257 770 633 A dog ran fast
Sprayer C292 1 1 1 1 1 A dog ran fast
SPFH D126 323 597 647 159 317 A dog ran fast
SPFH D307 711 535 323 793 769 A dog ran fast
SPFH D355 155 744 772 689 509 A dog ran fast
SPFH D893 1 1 1 1 1 A dog ran fast
Tractor A810 491 765 457 249 641 A dog ran fast
Tractor A773 222 155 650 278 215 A dog ran fast
Tractor A203 382 512 363 553 530 A dog ran fast
Tractor A329 332 459 251 341 475 A dog ran fast
Combine B 244 244 714 467 122 340 A dog ran fast
Combine B302 257 758 230 704 715 A cat ran slow
Combine B681 670 626 572 795 323 cat
Combine B514 768 510 546 542 582 A dog ran fast
Sprayer C850 553 624 557 660 337 A dog ran fast
Sprayer C202 561 733 443 107 526 A dog ran fast
Sprayer C619 256 226 257 770 633 A dog ran fast
# add id to identify which rows are not matching
df2 <- df2 %>% mutate(id = basecode)
df_unmatch <- anti_join(df2,df1)
# list of non-match are the ids of df_unmatch
df_unmatch$id
数据
#structure(list(Product = c("Tractor","Tractor","Combine","Sprayer","SPFH","Sprayer"),basecode = c("A810","A773","A203","A329","B 244","B302","B681","B514","C850","C202","C619","C292","D126","D307","D355","D893","A810","C619"),A = c(382,222,382,332,244,257,670,768,553,561,256,1,323,711,155,491,256),B = c(512,512,459,714,758,626,510,624,733,226,597,535,744,765,226),C = c(363,650,363,251,467,230,572,546,557,443,647,772,457,257),D = c(553,278,341,122,704,795,542,660,107,770,159,793,689,249,770
),E = c(530,215,530,475,340,715,582,337,526,633,317,769,509,641,633),F = c("A dog ran fast","A dog ran fast","cat","A dog ran fast"
)),row.names = c(NA,-27L),class = c("tbl_df","tbl","data.frame"
))
解决方法
它确实有效,除非您有一些特别的期望(请参阅 Limey 的评论)。您提供的两个文件实际上是相同的(请参阅 MonJeanJean 的评论),所以让我们从创建不匹配的行开始:
df1$F <- "A dog ran faster" ## df2 has "cat" somewhere
df2$A[16] <- 155
anti_join(df2,df1)
# A tibble: 2 x 8
Product basecode A B C D E F
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 SPFH D893 1 1 1 1 1 A dog ran fast
2 Combine B681 670 626 572 795 323 cat
你期望什么结果?
,不是最易读的解决方案,但如果您有很多行可能会很有用.. 你可以得到数学和不匹配的线条..
library(data.table)
match <- merge(as.data.table(df1)[,c(.SD,.(source = "df1",id1 = 1:nrow(df1)))],as.data.table(df2)[,.(source = "df2",id2 = 1:nrow(df1)))],by = c("Product","basecode","A","B","C","D","E","F" ),all = TRUE)[!is.na(source.x) & !is.na(source.y)]
unmatch <- merge(as.data.table(df1)[,all = TRUE)[is.na(source.x) | is.na(source.y)]
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。