为什么Apply函数会减慢data.table R脚本的性能提供示例？

如何解决为什么Apply函数会减慢data.table R脚本的性能提供示例？

我正在使用stl时间序列从150+百万行中按不同位置分组来识别异常值。由于使用了lapply + defined函数，这似乎非常慢。我还提供了可重现的示例来执行脚本。对提高脚本性能有任何建议或我做不到的任何事情吗？

样本数据表

dt1=data.table(location_id=rep("abc",52),report_date=c(rep("2020-04-22",24),rep("2020-04-23",rep("2020-04-24",4)),hour=c(rep(c(0:23),2),1,2,3),hr_visitors=c(20:67,345,236,123,67))


dt2=dt1[,date_hour := as.POSIXct(paste0(report_date,hour),format = "%Y-%m-%d %H")]

这是花费更多时间的脚本

stl_model_outlier<-function(xyz){
model=ts(xyz,frequency=24)
f=stl(model,"periodic",robust = TRUE)
result=data.table(row_id=which(f$weights<1e-8))[,outlier:="Yes"]
return(list(result))
}

Sys.time()
"2020-09-30 06:41:51 GMT"

results1<-dt2[,lapply(.SD,stl_model_outlier),by = location_id,.SDcols = "hr_visitors"]

Sys.time()
"2020-09-30 06:54:51 GMT"

解决方法

尝试一下：

stl_model_outlier <- function(x) {
  x.ts <- ts(x,frequency = 24)
  weights <- stl(x.ts,"periodic",robust = TRUE)$weights
  fifelse(weights < 1e-8,"Yes","No")
}

dt1[,outlier := stl_model_outlier(hr_visitors),by = location_id]

fifelse是ifelse的快速版本
由于仅适用于一列，因此不需要lapply和.SD。
您也不需要行dt2=dt1[,date_hour := ...。 data.table通过引用进行工作，请在运行该行之后检查dt1会发生什么。

您还可以将所有内容直接传递到data.table，而不用编写函数：

dt1[,outlier := fifelse(stl(ts(hr_visitors,frequency = 24),robust = TRUE)$weights < 1e-8,"No"),by = location_id]