如何解决快速准确地计算R 关于lm回归堆栈上的相关主题最小示例基准测试的更新和其他完整最小示例:
在给定以下约束的情况下,我想计算R中大小为 n 的数据集{x,y}的外部学生化残差:
- (非常)高精度
- 高性能(尽可能避免循环)
- R语言(包括RCPP)
R代码应该是快速的,因为它会在[10 ^ 3,10 ^ 6]中的 n 的多个数据集上广泛使用(最小10 ^ 9倍)。此问题是一项较大的工作的一部分,该工作用于估计需要学生化残差的自定义统计量。计算最多的部分是此处介绍的部分。因此,解决该问题将大大提高整体效率。
关于lm()回归
要收集学生化的外部残差,通常先运行lm()
,然后运行rstudent()
。 R函数使用的方法避免运行 n 回归来估计学生化残差,并节省了大量执行时间。但是,我不想使用lm()
,因为我只需要残差而没有随之而来的花哨的额外工作(因此节省了更多的执行时间)。
当尝试解密lm()
中的外部残差的R源代码时,我发现它有些模糊,因为它似乎是从其他外部文件中调用示例代码(例如influence()
函数)。因此,此时我无法收集足够的信息来仅使用源代码来复制代码段。
堆栈上的相关主题
已在堆栈中找到以下相关主题:How to compute Studentized Residuals in Python?
给出了一个Python过程的R实现,其中包括一个最小的示例(已通过@StéphaneLaurent进行了更正,请参见答案):
n = 10
set.seed(1)
x = rnorm(n)
y = rnorm(n)
m = 2
mean_y = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_y) %*% (y - mean_y)
beta_1 = ((y - mean_y) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_y
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_y) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n,function(i){
fit <- lm.fit(cbind(1,y[-i]),x[-i])
sum(fit$residuals^2)
},numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
reg = rstudent(lm(x ~ y))
res = cbind(reg,studentized_residuals)
产生以下差异:
index reg studentized_residuals
1 -0,595911898846465 -0,581348373714385
2 0,116208945967327 0,116097011762269
3 -2,04779452591111 -1,61939642040734
4 2,26350621688535 1,71995630000724
5 0,603322309518977 0,588222428131761
6 -1,5460639774285 -1,33486217871738
7 0,367900050364855 0,364393996552621
8 1,14745971090533 1,05271762293388
9 0,823888320713653 0,786630743176311
10 -0,449839343257121 -0,443475039943641
最小示例
以下R attemp已使用任意数据集进行了测试,仅用于说明目的。
它使用lm()
/ rstudent()
,对于我们的实际应用来说太慢了。两个参数n1
和n2
分别对应于迭代次数和向量的大小(在上面用 n 表示)。为了解决我们的问题,我们通常选择[10 ^ 6,10 ^ 9]中的n1
和[10 ^ 3,10 ^ 6]中的n2
:
Stud = function(n1,n2){
res = data.frame(matrix(vector(),n2,n1))
for(i in 1 : n1){
x = rnorm(n2)
y = rnorm(n2)
reg = lm(x ~ y)
res[,i] = rstudent(reg)
}
}
基准测试的更新和其他(完整)最小示例:
在这里,我们展示了一个完整的基准,其中针对lm()
测试了Stack的各种功能,目的是收集学生化的外部残差。为了收集这些残差,我们需要运行“ n”回归。在代码进行100和500次复制后给出结果。
#Packages
install.packages("Rcpp")
library(Rcpp)
install.packages("RcppArmadillo")
library(RcppArmadillo)
install.packages("RcppEigen")
library(RcppEigen)
install.packages("stats")
library(stats)
install.packages("speedglm")
library(speedglm)
install.packages("Rfast")
library(Rfast)
install.packages("rbenchmark")
library(rbenchmark)
## start from SEXP,most conversions,longest code
src <- '
Rcpp::List fLmSEXP(SEXP Xs,SEXP ys) {
Rcpp::NumericMatrix Xr(Xs);
Rcpp::NumericVector yr(ys);
int n = Xr.nrow(),k = Xr.ncol();
arma::mat X(Xr.begin(),n,k,false);
arma::colvec y(yr.begin(),yr.size(),false);
int df = n - k;
// fit model y ~ X,extract residuals
arma::colvec coef = arma::solve(X,y);
arma::colvec res = y - X*coef;
double s2 = std::inner_product(res.begin(),res.end(),res.begin(),0.0)/df;
// std.errors of coefficients
arma::colvec sderr = arma::sqrt(s2 *
arma::diagvec(arma::pinv(arma::trans(X)*X)));
return Rcpp::List::create(Rcpp::Named("coefficients")=coef,Rcpp::Named("stderr") =sderr,Rcpp::Named("df") =df,Rcpp::Named("residuals") =res);
}
'
cppFunction(code=src,depends="RcppArmadillo")
## start from Rcpp types are early RcppArmadillo examples did
src <- '
Rcpp::List fLmTwoCasts(Rcpp::NumericMatrix Xr,Rcpp::NumericVector yr) {
int n = Xr.nrow(),depends="RcppArmadillo")
## start from Armadillo types
src <- '
Rcpp::List fLmOneCast(arma::mat X,arma::colvec y) {
int df = X.n_rows - X.n_cols;
// fit model y ~ X,depends="RcppArmadillo")
## start from Armadillo types passed as constant references
src <- '
Rcpp::List fLmConstRef(const arma::mat & X,const arma::colvec & y) {
int df = X.n_rows - X.n_cols;
// fit model y ~ X,depends="RcppArmadillo")
#Benchmark
data = benchmark("OneCast" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n,function(i){
fit <- fLmOneCast(cbind(1,x[-i])
sum(fit$residuals^2)
},numeric(1)) / (n-m-1))
SE_regression = var_e * (sqrt(1 - h_ii))
studentized_residuals = residuals / SE_regression
},"TwoCast" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n,function(i){
fit <- fLmTwoCasts(cbind(1,"Const" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n,function(i){
fit <- fLmConstRef(cbind(1,"Sexp" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n,function(i){
fit <- fLmSEXP(cbind(1,"Fast" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n,function(i){
fit <- fastLm(x[-i] ~ y[-i])
sum(fit$residuals^2)
},"Speed" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n,function(i){
fit <- speedlm(x[-i] ~ y[-i],fitted = T)
sum((x[-i] - fit$fitted.values)^2)
},".Fit" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n,"Fit" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n,function(i){
fit <- lmfit(cbind(1,"Lm" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
m=2
mean_data = mean(y)
mean_x = mean(x)
diff_mean_sqr = (y - mean_data) %*% (y - mean_data)
beta_1 = ((y - mean_data) %*% (x - mean_x)) / diff_mean_sqr
beta_0 = mean_x - c(beta_1) * mean_data
x_hat = beta_0 + c(beta_1) * y
residuals = x - x_hat
h_ii = ((y - mean_data) ^ 2) / c(diff_mean_sqr) + (1 / n)
var_e = sqrt(vapply(1:n,function(i){
fit <- lm(x[-i] ~ y[-i])
sum(fit$residuals^2)
},"Basic" = {
n = 15
set.seed(1)
y = rnorm(n)
x <- rnorm(n)
reg <- lm(x ~ y)
reg_stud <- rstudent(reg)
},replications = 500,columns = c("test","elapsed","replications"))
结果:
在这个单一基准上,rstudent(lm())
比其他所有内容都快{strong> :
test elapsed replications
7 .Fit 13.84 100
10 Basic 0.25 100
3 Const 7.37 100
5 Fast 99.84 100
8 Fit 7.06 100
9 Lm 105.25 100
1 OneCast 7.61 100
4 Sexp 7.66 100
6 Speed 184.76 100
2 TwoCast 7.17 100
7 .Fit 63.63 500
10 Basic 0.93 500
3 Const 34.44 500
5 Fast 438.95 500
8 Fit 31.11 500
9 Lm 471.37 500
1 OneCast 34.29 500
4 Sexp 33.48 500
6 Speed 794.73 500
2 TwoCast 33.51 500
解释
似乎R使用了一种分析替代方法,避免使用'n'回归,从而使计算速度更快。
因此,问题仍然存在:如何在rstudent(lm())
方面具有竞争力,以及如何反向使用原始源代码(难以收集)?
解决方法
通过将您的var_e
替换为
var_e = vapply(1:n,function(i){
sigma(lm(x[-i] ~ y[-i]))
},numeric(1))
要高效地获取该信息,请不要使用lm
,而应使用lm.fit
:
var_e = sqrt(vapply(1:n,function(i){
fit <- lm.fit(cbind(1,y[-i]),x[-i])
sum(fit$residuals^2)
},numeric(1)) / (n-m-1))
,
编辑:
编辑是为了表明找到了比之前给出的更快的函数:
)
到目前为止,这个功能非常快。
上一个回答
由于您使用的是 R,您可以使用 fast_rstudent <-function(X,y,intercept = TRUE){
mqr <- .Call(stats:::C_Cdqrls,cbind(intercept,X),tol,FALSE)
res <- .Call(stats:::C_influence,mqr,mqr$residuals,1e-12)
mqr$residuals/(res$sigma*sqrt(1-res$hat))
}
分解来解决这个问题。您的目标是通过摆脱开销函数调用等编写一个比内置函数更快的 qr
函数。这意味着您应该只使用必要的内部函数。以下是执行此操作的快速方法:
rstudent
在某种程度上,该函数几乎完全移除了开销函数,从而滥用了 R。这假设提供给函数的内容是正确的。
结果:
my_rstudent <- function (X,intercept = TRUE) {
X <- cbind(intercept,X)
u <- .Call(stats:::C_Cdqrls,X,1e-7,FALSE)
d <- dim(X)
n <- as.integer(d[1L])
k <- as.integer(d[2L])
df_res <- n - k
z <- .Internal(diag(1,n,k))
v <- .Fortran(.F_dqrqy,as.double(u$qr),k,as.double(u$qraux),z,qy = z)$qy
h_ii <-.Internal(rowSums(v^2,FALSE))
rstand <- u$residuals/sqrt(sum(u$residuals**2)/df_res)/sqrt(1-h_ii)
rstand * sqrt((df_res - 1)/( df_res - rstand^2))
}
基准:
n = 10
set.seed(1)
x = rnorm(n)
y = rnorm(n)
cbind(mine=my_rstudent(x,y),from_R=rstudent(lm(y~x)))
mine from_R
1 0.92113157 0.92113157
2 0.15753536 0.15753536
3 -1.69587949 -1.69587949
4 -3.59182456 -3.59182456
5 0.98274664 0.98274664
6 -0.85765961 -0.85765961
7 -0.07768369 -0.07768369
8 1.05874766 1.05874766
9 0.80181623 0.80181623
10 0.11418833 0.11418833
对于小数据集,开销函数退出会减慢 rstudent 的计算速度。
相对较大的数据集:
microbenchmark::microbenchmark(my_rstudent(x,rstudent(lm(y~x)),unit="relative",times = 100)
Unit: relative
expr min lq mean median uq max neval
my_rstudent(x,y) 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 100
rstudent(lm(y ~ x)) 45.36667 37.20755 26.89753 24.29545 22.39587 11.31733 100
庞大的数据集
n = 1000
set.seed(1)
x = rnorm(n)
y = rnorm(n)
microbenchmark::microbenchmark(my_rstudent(x,y) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
rstudent(lm(y ~ x)) 8.530228 8.059269 7.700426 7.848123 7.616909 3.877305 100
非常庞大的数据集
n = 1000000
set.seed(1)
x = rnorm(n)
y = rnorm(n)
microbenchmark::microbenchmark(my_rstudent(x,times = 10)
Unit: relative
expr min lq mean median uq max neval
my_rstudent(x,y) 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10
rstudent(lm(y ~ x)) 1.510198 1.560989 1.486083 1.666609 1.603455 1.01154 10
,
我认为解决您的问题的方法是先减少所有必要的函数开销,如果这样做不够快,请尝试将代码转换为C ++并使用Rccp运行它。如果您使用自己的实现而不是像我那样使用.lm.fit
来计算lm.fit
的残差,那么很有可能可以改善我的结果。
我还检查了,根据您要使用的函数(lm
,lm.fit
,.lm.fit
),学生化残差是否存在差异,事实证明这是案子。但是,这里我函数的残差等于MASS::studres
对y ~ x
的回归,其中x仅具有一列。
这是我的代码和基准,与上面称为“基本”的最快版本相比:
library(rbenchmark)
library(microbenchmark)
library(MASS)
set.seed(1)
x <- matrix(rnorm(500),ncol = 1)
y <- matrix(rnorm(500),ncol = 1)
myFunc <- function(x,n = 500){
# tmp <- .lm.fit(x,y) # linear model fit
object <- lm.fit(x = x,y = y)
resid <- object$residuals
hat <- lm.influence(object,do.coef = FALSE)$hat
# hat <- hat[hat > 0] # remove checks
# ok <- !(is.na(resid)) # remove checks
# n.miss <- sum(!ok) # remove checks
# resid <- resid[ok] # remove checks
# n <- length(resid)
# p <- object$rank # equal to one
p <- 1
rdf <- n - 1
studres <- resid
stddev <- sqrt(sum(resid^2)/rdf)
sr <- resid/(sqrt(1 - hat) * stddev)
stdres <- sr
studres <- sr/sqrt((n - p - sr^2)/(n - p - 1))
studres <- naresid(object$na.action,studres)
return(studres)
}
test1 <- stats::rstudent(lm(x ~ y)) # rstudent doesn't work with lm.fit
test2 <- MASS::studres(lm(x ~ y))
test3 <- MASS::studres(lm.fit(x,y))
test4 <- myFunc(x,n = 500)
> head(cbind(test1,test2,test3,test4))
test1 test2 test3 test4
1 -0.6368094 -0.6368094 0.04696790 0.04696790
2 0.1493050 0.1493050 -0.27286396 -0.27286396
3 -0.8941217 -0.8941217 -1.15505676 -1.15505676
4 1.5598965 1.5598965 0.07729179 0.07729179
5 0.3440252 0.3440252 0.95155123 0.95155123
6 -0.7714317 -0.7714317 1.47600416 1.47600416
####################################
mbm <- microbenchmark("lm" = {rstudent(lm(y~x)) },"MASS_lm" = {
MASS::studres(lm(y~x))
},"MASS_lm.fit" = {
MASS::studres(lm.fit(x = x,y = y))
},"myFunc" = {myFunc(x,n = 500)},times = 100
)
> mbm
Unit: microseconds
expr min lq mean median uq max neval
lm 767.001 869.1510 1188.023 977.1505 1185.5010 8279.801 100
MASS_lm 704.601 909.2000 1085.261 997.3515 1168.8505 2052.202 100
MASS_lm.fit 168.001 195.0510 282.166 212.9510 254.1015 2912.201 100
myFunc 147.901 168.8015 234.261 190.0010 249.7515 1193.701 100
请注意,您必须根据向量x或y的长度指定n
。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。