使用R并行加速引导

我想加快我的引导功能,这本身很好的工作.我读到,由于R 2.14有一个叫Parallel的包,但是我觉得sb很难.对计算机科学的了解较少,才能真正实现.也许有人可以帮忙

所以这里我们有一个bootstrap：

n<-1000
boot<-1000
x<-rnorm(n,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
boot_b<-numeric()
for(i in 1:boot){
  bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
  boot_b[i]<-lm(y~x,bootstrap_data)$coef[2]
  print(paste('Run',i,sep=" "))
}

目标是使用并行处理/利用我的PC的多个内核.我在Windows下运行R.谢谢！

编辑(诺亚回复后)

以下语法可用于测试：

library(foreach)
library(parallel)
library(doParallel)
registerDoParallel(cores=detectCores(all.tests=TRUE))
n<-1000
boot<-1000
x<-rnorm(n,y)
start1<-Sys.time()
boot_b <- foreach(i=1:boot,.combine=c) %dopar% {
  bootstrap_data<-data[sample(nrow(data),]
  unname(lm(y~x,bootstrap_data)$coef[2])
}
end1<-Sys.time()
boot_b<-numeric()
start2<-Sys.time()
for(i in 1:boot){
  bootstrap_data<-data[sample(nrow(data),bootstrap_data)$coef[2]
}
end2<-Sys.time()
start1-end1
start2-end2
as.numeric(start1-end1)/as.numeric(start2-end2)

然而,在我的机器上,简单的R代码更快.这是并行处理的已知副作用之一,也就是说,它会引起额外的开销来增加这样的“简单任务”的时间？

编辑：在我的机器上,并行代码比“简单”代码长约5倍.这个因素显然没有改变,因为我增加了任务的复杂性(例如增加引导或n).那么也许这是一个代码或我的机器的问题(基于Windows的处理？).

解决方法

尝试启动包.它被很好地优化,并且包含一个平行的参数.这个软件包的棘手之处在于,您必须编写新的函数来计算统计信息,接受您正在处理的数据和一系列索引以重新采样数据.所以,从你定义数据的地方开始,你可以这样做：

# Define a function to resample the data set from a vector of indices
# and return the slope
slopeFun <- function(df,i) {
  #df must be a data frame.
  #i is the vector of row indices that boot will pass
  xResamp <- df[i,]
  slope <- lm(y ~ x,data=xResamp)$coef[2] 
} 

# Then carry out the resampling
b <- boot(data,slopeFun,R=1000,parallel="multicore")

b $t是重采样统计量的向量,引导有很多好的方法可以轻松地完成它的工作 – 例如plot(b)

请注意,并行方法取决于您的平台.在Windows机器上,您需要使用parallel =“sNow”.

使用R并行加速引导

解决方法

相关推荐