组合错误唯一x，2，粘贴，折叠=“和”：n < m

如何解决组合错误唯一x，2，粘贴，折叠=“和”：n < m

我有一个包含 11 个变量的 185,686 行数据框，但我只对两个感兴趣：Order.ID 和 Product

原始数据框的每一行都包含 ID、产品、数量、地址等的唯一组合。从这个 df 我创建了一个新的，只有购买的 ID 和产品，其中有多个产品买了。

所以我试图找出哪些产品经常一起销售。 我已经确保原始数据框没有相同的行或空行 一切看起来都很好，除了 R 说产品有 21 个级别但其中两个是错误的，所以数据框只有19个级别的产品。但是如果类型 nlevels(venda.id$Product) 我得到 21。

  Order.ID  Product
1 176560    Google Phone
2 176560    Wired Headphones
3 176574    Google Phone
4 176574    USB-C Charging Cable
5 176586    AAA Batteries (4-pack)
6 176586    Google Phone
7 176672    Lightning Charging Cable
8 176672    USB-C Charging Cable
9 176681    Apple AirPods Headphones
10 176681   ThinkPad Laptop
11 176689   Bose SoundSport Headphones
12 176689   AAA Batteries (4-pack)
13 176739   34in Ultrawide Monitor
14 176739   Google Phone
15 176774   Lightning Charging Cable
16 176774   USB-C Charging Cable
17 176781   iPhone
18 176781   Lightning Charging Cable

structure(list(Order.ID = structure(c(1L,1L,2L,3L,4L,5L,6L,7L,8L,9L,10L,10L),.Label = c("176560","176574","176586","176672","176681","176689","176739","176774","176781","176797"),class = "factor"),Product = structure(c(5L,4L),.Label = c("34in Ultrawide Monitor","AAA Batteries (4-pack)","Apple AirPods Headphones","Bose SoundSport Headphones","Google Phone","iPhone","Lightning Charging Cable","ThinkPad Laptop","USB-C Charging Cable","Wired Headphones"),class = "factor")),row.names = c(NA,20L
),class = "data.frame")

当我尝试获取前 2 个组合时出现问题：

tail(sort(table(unlist(tapply(as.character(venda.id$Product),venda.id$Order.ID,FUN=function(x) combn(unique(x),2,paste,collapse=" and "))))),2)

Error in combn(unique(x),collapse = " and ") : n < m

该代码应该产生如下内容：（不知道答案是什么）

Lightning Charging Cable and iPhone Wired Headphones and USB-C Charging 
                x                                y

x 和 y 是表 table

计算的频率

如果我不将 as.character 用于 Product 列，则会出现不同的错误：

Error in class(out) <- class(x0) : adding class "factor" to an invalid object

我尝试了替代代码，但还是出现了同样的错误。

我第一次运行时就成功了，但结果似乎是错误的，因为计数低至 16，而数据有 14,128 行。现在它不再运行了。

有人知道如何解决这个问题吗？

更新：我检测到错误发生在第 783 行和第 784 行，有 2 个相同的产品关联到同一个 ID，尽管原始数据中没有发生这种情况。

R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  Grdevices utils     datasets  methods   base     

other attached packages:
 [1] xts_0.12.1        zoo_1.8-9         lubridate_1.7.10  viridis_0.5.1    
 [5] viridisLite_0.3.0 hrbrthemes_0.8.0  forcats_0.5.1     stringr_1.4.0    
 [9] purrr_0.3.4       readr_1.4.0       tidyr_1.1.3       tibble_3.0.6     
[13] tidyverse_1.3.0   dygraphs_1.1.1.6  ggplot2_3.3.3     dplyr_1.0.5      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        lattice_0.20-41   assertthat_0.2.1  digest_0.6.27    
 [5] utf8_1.1.4        R6_2.5.0          cellranger_1.1.0  backports_1.2.1  
 [9] reprex_1.0.0      evaluate_0.14     httr_1.4.2        pillar_1.5.0     
[13] gdtools_0.2.3     rlang_0.4.10      readxl_1.3.1      rstudioapi_0.13  
[17] extrafontdb_1.0   rmarkdown_2.7     labeling_0.4.2    extrafont_0.17   
[21] htmlwidgets_1.5.3 munsell_0.5.0     tinytex_0.30      broom_0.7.5      
[25] compiler_4.0.4    modelr_0.1.8      xfun_0.21         systemfonts_1.0.1
[29] pkgconfig_2.0.3   htmltools_0.5.1.1 tidyselect_1.1.0  gridExtra_2.3    
[33] fansi_0.4.2       Crayon_1.4.1      dbplyr_2.1.0      withr_2.4.1      
[37] grid_4.0.4        jsonlite_1.7.2    Rttf2pt1_1.3.8    gtable_0.3.0     
[41] lifecycle_1.0.0   DBI_1.1.1         magrittr_2.0.1    scales_1.1.1     
[45] cli_2.3.1         stringi_1.5.3     farver_2.1.0      fs_1.5.0         
[49] xml2_1.3.2        ellipsis_0.3.1    generics_0.1.0    vctrs_0.3.6      
[53] tools_4.0.4       glue_1.4.2        hms_1.0.0         yaml_2.2.1       
[57] colorspace_2.0-0  rvest_1.0.0       knitr_1.31        haven_2.3.1

解决方法

尝试更改原始数据框的类类型。

venda.id <- type.convert(venda.id,as.is = TRUE)

执行此操作后再次运行代码。

所以当代码到达第 783 行时出现问题，我可以通过这样做来使其工作

test=venda.id[784:14034,]

tail(sort(table(unlist(tapply(as.character(test$Product),test$Order.ID,FUN=function(x) combn(unique(x),2,paste,collapse=" and "))))),2)

如果我执行 test=venda.id[1:782,]

，它也适用

但是如果第 783 行位于间隔中间，它将不起作用，我猜这是因为第 782 行和 783 行具有相同的产品，但具有不同的 ID，但这也发生在数据中并且没有发生错误

我们也可以使用 type_convert

library(readr)
vend.id <- type_convert(venda.id)