使用 mlr3 结合弹性网络和逻辑回归的两级堆叠学习器enseble 模型

如何解决使用 mlr3 结合弹性网络和逻辑回归的两级堆叠学习器enseble 模型

我尝试解决医学中的一个常见问题:将预测模型与其他来源相结合,例如,专家意见 [有时在医学中非常深入],称为 superdoc 预测器在这篇文章中。

这可以通过将模型与逻辑回归(进入专家意见)叠加来解决,如本文第 26 页所述:

Afshar P、Mohammadi A、Plataniotis KN、Oikonomou A、Benali H。来自 手工制作到基于深度学习的癌症放射组学:挑战和 机会。 IEEE 信号处理杂志 2019; 36:132-60。可用 here

我在没有考虑过拟合的情况下尝试了这个 here(我没有应用低级学习者的折叠预测):

示例数据

# library
library(tidyverse)
library(caret)
library(glmnet)
library(mlbench)

# get example data
data(PimaIndiansDiabetes,package="mlbench")
data <- PimaIndiansDiabetes

# add the super doctors opinion to the data
set.seed(2323)
data %>% 
  rowwise() %>% 
  mutate(superdoc=case_when(diabetes=="pos" ~ as.numeric(sample(0:2,1)),TRUE~ 0)) -> data

# separate the data in a training set and test set
train.data <- data[1:550,]
test.data <- data[551:768,]

不考虑折叠预测的堆叠模型:

# elastic net regression (without the superdoc's opinion)
set.seed(2323)
model <- train(
  diabetes ~.,data = train.data %>% select(-superdoc),method = "glmnet",trControl = trainControl("repeatedcv",number = 10,repeats=10,classProbs = TRUE,savePredictions = TRUE,summaryFunction = twoClassSummary),tuneLength = 10,metric="ROC" #ROC metric is in twoClassSummary
)


# extract the coefficients for the best alpha and lambda  
coef(model$finalModel,model$finalModel$lambdaOpt) -> coeffs
tidy(coeffs) %>% tibble() -> coeffs

coef.interc = coeffs %>% filter(row=="(Intercept)") %>% pull(value)
coef.pregnant = coeffs %>% filter(row=="pregnant") %>% pull(value)
coef.glucose = coeffs %>% filter(row=="glucose") %>% pull(value)
coef.pressure = coeffs %>% filter(row=="pressure") %>% pull(value)
coef.mass = coeffs %>% filter(row=="mass") %>% pull(value)
coef.pedigree = coeffs %>% filter(row=="pedigree") %>% pull(value)
coef.age = coeffs %>% filter(row=="age") %>% pull(value)


# combine the model with the superdoc's opinion in a logistic regression model
finalmodel = glm(diabetes ~ superdoc + I(coef.interc + coef.pregnant*pregnant + coef.glucose*glucose + coef.pressure*pressure + coef.mass*mass + coef.pedigree*pedigree + coef.age*age),family=binomial,data=train.data)


# make predictions on the test data
predict(finalmodel,test.data,type="response") -> predictions


# check the AUC of the model in the test data
roc(test.data$diabetes,predictions,ci=TRUE) 
#> Setting levels: control = neg,case = pos
#> Setting direction: controls < cases
#> 
#> Call:
#> roc.default(response = test.data$diabetes,predictor = predictions,ci = TRUE)
#> 
#> Data: predictions in 145 controls (test.data$diabetes neg) < 73 cases (test.data$diabetes pos).
#> Area under the curve: 0.9345
#> 95% CI: 0.8969-0.9721 (DeLong)

现在我想根据这篇非常有用的帖子考虑使用 mlr3 包系列的折叠预测:Tuning a stacked learner

#library
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(mlr3filters)
library(mlr3tuning)
library(paradox)
library(glmnet)

# creat elastic net regression
glmnet_lrn =  lrn("classif.cv_glmnet",predict_type = "prob")

# create the learner out-of-bag predictions
glmnet_cv1 = po("learner_cv",glmnet_lrn,id = "glmnet") #I could not find a setting to filter the predictors (ie,not send the superdoc predictor here)

# summarize steps 
level0 = gunion(list(
  glmnet_cv1,po("nop",id = "only_superdoc_predictor")))  %>>% #I could not find a setting to send only the superdoc predictor to "union1"
  po("featureunion",id = "union1")


# final logistic regression
log_reg_lrn = lrn("classif.log_reg",predict_type = "prob")

# combine ensemble model
ensemble = level0 %>>% log_reg_lrn
ensemble$plot(html = FALSE)

reprex package (v1.0.0) 于 2021 年 3 月 15 日创建

我的问题(我对 mlr3 软件包系列比较陌生)

  1. mlr3 包系列是否非常适合我尝试构建的集成模型?
  2. 如果是,我最终确定了集成模型并在 test.data 上做出预测有多冷

解决方法

我认为 mlr3 / mlr3pipelines 非常适合您的任务。看来您缺少的主要是 PipeOpSelect / po("select"),它允许您根据名称或其他属性提取特征并利用 Selector 对象。你的代码应该看起来像

library("mlr3")
library("mlr3pipelines")
library("mlr3learners")

# creat elastic net regression
glmnet_lrn = lrn("classif.cv_glmnet",predict_type = "prob")

# create the learner out-of-bag predictions
glmnet_cv1 = po("learner_cv",glmnet_lrn,id = "glmnet")

# PipeOp that drops 'superdoc',i.e. selects all except 'superdoc'
# (ID given to avoid ID clash with other selector)
drop_superdoc = po("select",id = "drop.superdoc",selector = selector_invert(selector_name("superdoc")))

# PipeOp that selects 'superdoc' (and drops all other columns)
select_superdoc = po("select",id = "select.superdoc",selector = selector_name("superdoc"))

# superdoc along one path,the fitted model along the other
stacking_layer = gunion(list(
  select_superdoc,drop_superdoc %>>% glmnet_cv1
)) %>>% po("featureunion",id = "union1")

# final logistic regression
log_reg_lrn = lrn("classif.log_reg",predict_type = "prob")

# combine ensemble model
ensemble = stacking_layer %>>% log_reg_lrn

这是它的样子:

ensemble$plot(html = FALSE)

The stacking graph.

为了训练和评估模型,我们需要创建 Task 对象:

train.task <- TaskClassif$new("train.data",train.data,target = "diabetes")
test.task <- TaskClassif$new("test.data",test.data,target = "diabetes")

现在可以训练模型,然后可以用于预测,并且可以评估预测的质量。如果我们将 ensemble 变成 Learner,效果最好:

elearner = as_learner(ensemble)
# Train the Learner:
elearner$train(train.task)
# (The training may give a warning because the glm gets the colinear features:
# The positive and the negative probabilities)

获取对测试集的预测:

prediction = elearner$predict(test.task)
print(prediction)
#> <PredictionClassif> for 218 observations:
#>     row_ids truth response  prob.neg   prob.pos
#>           1   neg      neg 0.9417067 0.05829330
#>           2   neg      neg 0.9546343 0.04536566
#>           3   neg      neg 0.9152019 0.08479810
#> ---                                            
#>         216   neg      neg 0.9147406 0.08525943
#>         217   pos      neg 0.9078216 0.09217836
#>         218   neg      neg 0.9578515 0.04214854

预测是在 Task 上进行的,因此它可以直接用于根据真实情况衡量性能,例如使用 "classif.auc" Measure:

msr("classif.auc")$score(prediction)
#> [1] 0.9308455

这里有两个注意事项:

  1. 您已手动将数据拆分为训练集和测试集。 mlr3 使您可以根据单个 Task 对象自动执行 resampling。这可以超越简单的训练测试拆分。使用问题中的 data 并进行 10 折交叉验证将如下所示:
    all.task <- TaskClassif$new("all.data",data,target = "diabetes")
    rr = resample(all.task,elearner,rsmp("cv"))  # will take some time
    rr$aggregate(msr("classif.auc"))
    #> classif.auc 
    #>   0.9366438
    
  2. 我已经展示了如何使用 po("select") PipeOp 构建图形,因为它是完全通用的:您可以选择在 glmnet_lrn {{1} },以及直接在 Learner 中,通过使用 log_reg_lrn 值。如果您想要做的只是从单个操作中“转移”功能,您还可以使用 selectoraffect_columns 来选择您想要的列。下面创建了一个(线性)图,其功能完全相同,但灵活性较差:
    Selector

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-
参考1 参考2 解决方案 # 点击安装源 协议选择 http:// 路径填写 mirrors.aliyun.com/centos/8.3.2011/BaseOS/x86_64/os URL类型 软件库URL 其他路径 # 版本 7 mirrors.aliyun.com/centos/7/os/x86
报错1 [root@slave1 data_mocker]# kafka-console-consumer.sh --bootstrap-server slave1:9092 --topic topic_db [2023-12-19 18:31:12,770] WARN [Consumer clie
错误1 # 重写数据 hive (edu)&gt; insert overwrite table dwd_trade_cart_add_inc &gt; select data.id, &gt; data.user_id, &gt; data.course_id, &gt; date_format(
错误1 hive (edu)&gt; insert into huanhuan values(1,&#39;haoge&#39;); Query ID = root_20240110071417_fe1517ad-3607-41f4-bdcf-d00b98ac443e Total jobs = 1
报错1:执行到如下就不执行了,没有显示Successfully registered new MBean. [root@slave1 bin]# /usr/local/software/flume-1.9.0/bin/flume-ng agent -n a1 -c /usr/local/softwa
虚拟及没有启动任何服务器查看jps会显示jps,如果没有显示任何东西 [root@slave2 ~]# jps 9647 Jps 解决方案 # 进入/tmp查看 [root@slave1 dfs]# cd /tmp [root@slave1 tmp]# ll 总用量 48 drwxr-xr-x. 2
报错1 hive&gt; show databases; OK Failed with exception java.io.IOException:java.lang.RuntimeException: Error in configuring object Time taken: 0.474 se
报错1 [root@localhost ~]# vim -bash: vim: 未找到命令 安装vim yum -y install vim* # 查看是否安装成功 [root@hadoop01 hadoop]# rpm -qa |grep vim vim-X11-7.4.629-8.el7_9.x
修改hadoop配置 vi /usr/local/software/hadoop-2.9.2/etc/hadoop/yarn-site.xml # 添加如下 &lt;configuration&gt; &lt;property&gt; &lt;name&gt;yarn.nodemanager.res