spec_tbl_df 在与普通 tibble 相同的操作上慢 10 倍以上

如何解决spec_tbl_df 在与普通 tibble 相同的操作上慢 10 倍以上

所以我真的很想知道为什么使用相同数据的两个不同的 R 会话会产生截然不同的时间来完成相同的任务。在多次重启 R、清除所有变量并真正运行干净的 R 之后，我发现了问题：vroom 和 readr 提供的新数据结构由于某种原因非常缓慢在我的剧本上。当然，解决此问题的最简单方法是在加载数据后立即将其转换为小标题。或者是否有其他解释，例如我的函数中糟糕的编码实践可以解释缓慢的行为？或者，这是这些软件包最近更新的错误吗？如果是这样，并且如果有人在向 tidyverse 报告错误方面更有经验，那么这里有一个 repex 显示行为，因为我觉得这超出了我的范围。

#Load packages
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter,lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect,setdiff,setequal,union
library(purrr)
library(vroom)
library(tidyr)
library(microbenchmark)
#Genenrate some dummy data
ex_data <- tibble(
  sd = 1,mean = 1:1000,a1 = rnorm(1000,mean,sd),a2 = rnorm(1000,a3 = rnorm(1000,sd)
  ) %>% 
  mutate(
    a1 = if_else(a1<mean,NA_real_,a1),a2 = if_else(a2<mean,a2),a3 = if_else(a3<mean,a3)
  )
#Wrapper function discovering the behavIoUre
impute_row <- function(mean,sd,data){
  if(!anyNA(data)){
    return(data)
  }else{
    data <- as.data.frame(data)
    data[is.na(data)] <-  rnorm(n = sum(is.na(data)),mean = mean,sd = sd)
    return(data)
  }
}
#Main function
imputer <- function(data){
  data %>% 
    mutate(
      data = pmap(list(mean,data),impute_row)
    ) %>% 
    unnest(cols = data)
}
#Generate dummy file
out_file <- tempfile(fileext = "csv")
vroom_write(ex_data,out_file,",")
#Read it in
ex_data_spc <- vroom(out_file,col_types = cols()) %>% 
  nest(data = -c(mean,sd))
#nest the original data as well
ex_data <- ex_data %>% 
  nest(data = -c(mean,sd))
#Benchmark
microbenchmark(
  tib = imputer(ex_data),spc_tib = imputer(ex_data_spc),times = 10
)
#> Unit: milliseconds
#>     expr        min         lq       mean     median        uq       max neval
#>      tib   82.81192   87.45288   89.19118   90.47263   91.2216   93.4418    10
#>  spc_tib 1041.90378 1070.00579 1244.97090 1076.92022 1093.0054 2780.0722    10

^{由 reprex package (v2.0.0) 于 2021 年 6 月 14 日创建}

在最坏的情况下，它比在 tibble 上运行要慢近 30 倍。

解决方法

This 是我想到的问题。众所周知，这些问题发生在 vroom 上，而不是发生在 spec_tbl_df 类上，后者实际上没有太大作用。

vroom 做各种事情来尝试加快阅读速度； AFAIK 主要是通过懒惰阅读。这就是在比较两个数据集时获得所有这些不同组件的方式。

使用 vroom：

~~~(snip)~~~
ex_data_spc <- vroom(out_file,col_types = cols()) %>% 
  nest(data = -c(mean,sd))
~~~(snip)~~~

#> Unit: milliseconds
#>     expr       min        lq     mean    median        uq       max neval cld
#>  spc_tib 1679.2088 1704.3085 2106.864 1731.6694 1942.9444 4918.4498    10   b
#>      tib  149.8716  158.8548  169.489  170.3735  182.5681  192.8533    10  a

all.equal(ex_data,ex_data_spc)
#>    [1] "Component \"data\": Component 1: Attributes: < Names: 1 string mismatch >"                                                 
#>    [2] "Component \"data\": Component 1: Attributes: < Length mismatch: comparison on first 2 components >"                        
#>    [3] "Component \"data\": Component 1: Attributes: < Component \"class\": Lengths (3,4) differ (string compare on first 3) >"   
#>    [4] "Component \"data\": Component 1: Attributes: < Component \"class\": 3 string mismatches >"                                 
#>    [5] "Component \"data\": Component 1: Attributes: < Component 2: Modes: numeric,externalptr >"  
                               
~~~(snip)~~~

带阅读器：

~~~(snip)~~~
ex_data_spc <- readr::read_csv(out_file,sd))
~~~(snip)~~~
#> Unit: milliseconds
#>     expr      min       lq     mean   median       uq      max neval cld
#>  spc_tib 148.9432 161.7315 181.2137 184.4592 191.9048 219.7883    10   a
#>      tib 161.9441 166.7826 175.3644 175.3354 181.4598 197.5544    10   a

all.equal(ex_data,ex_data_spc)
#> [1] TRUE

如果您愿意，可以将您的 reprex 发布到该问题上。