如何解决试图找出岭回归的测试和训练误差作为样本大小的函数
我在 R 中使用 Hitters 数据集。目前我拟合了一个线性回归,从所有其他协变量预测工资,样本大小从 20 到 75 不等,我计算了平均测试/训练错误:
data("Hitters",package = 'ISLR')
Hitters = na.omit(Hitters)
set.seed(1)
train.idx = sample(1:nrow(Hitters),75,replace=FALSE)
train = Hitters[train.idx,-20]
test = Hitters[-train.idx,-20]
errs <- rep(NA,56)
for (ii in 20:75){
train.idx = sample(1:nrow(Hitters),ii,replace=FALSE)
train = Hitters[train.idx,-20]
test = Hitters[-train.idx,-20]
train.lm <- lm(Salary ~.,- Salary,data = train)
train.pred <- predict(train.lm,train)
test.pred <- predict(train.lm,data = test)
errs[ii-19] <- mean((test.pred - train$Salary)^2)
}
errs
现在,我尝试使用我之前创建的正则化参数为 20 的样本对 Ridge 回归执行相同的操作。我尝试过:
x_train = model.matrix(Salary~.,train)[,-1]
x_test = model.matrix(Salary~.,test)[,-1]
y_train = train$Salary
y_test = test$Salary
#cv.out = cv.glmnet(x_train,y_train,alpha = 0)
#lam = cv.out$lambda.min
errs.train <- rep(NA,56)
for (ii in 20:75){
ridge_mod = glmnet(x_train,alpha=0,lambda = 20)
ridge_pred = predict(ridge_mod,newx = x_test)
#errs.test[ii] <- mean((ridge_pred - y_test)^2)
errs.train[ii-19] <- mean((ridge_pred - y_train)^2)
}
errs.train
解决方法
lm
的第一部分代码存在一些错误。它应该是 predict(train.lm,newdata = test)
而不是 predict(train.lm,data = test)
。如果您不确定输入,请执行 ?predict.lm
。其次,如果您对测试集中的错误感兴趣,您应该用 test
和 test$Salary
中的值减去 train
的预测。像下面这样的东西应该可以工作:
data("Hitters",package = 'ISLR')
Hitters = na.omit(Hitters)
set.seed(1)
sample_size = 20:75
errs = vector("numeric",length(sample_size))
for (ii in seq_along(sample_size)){
train.idx = sample(1:nrow(Hitters),sample_size[ii],replace=FALSE)
train = Hitters[train.idx,-20]
test = Hitters[-train.idx,-20]
train.lm <- lm(Salary ~.,data = train)
test.pred <- predict(train.lm,newdata = test)
errs[ii] <- mean((test.pred - test$Salary)^2)
}
现在对于岭,唯一的区别是您在每次迭代时创建模型矩阵和子集:
errs.test = vector("numeric",length(sample_size))
x_data = model.matrix(Salary~.,Hitters)[,-1]
y_data = Hitters$Salary
for (ii in seq_along(sample_size)){
train.idx = sample(1:nrow(x_data),replace=FALSE)
x_train = x_data[train.idx,]
x_test = x_data[-train.idx,]
y_train = y_data[train.idx]
y_test = y_data[-train.idx]
ridge_mod = glmnet(x_train,y_train,alpha=0,lambda = 20)
ridge_pred = predict(ridge_mod,newx = x_test)
errs.test[ii] <- mean((ridge_pred - y_test)^2)
}
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。