插入符号分类阈值

如何解决插入符号分类阈值

我一直在使用 gbm 的 caret 包中的 Rstudio 来查找发生故障的概率。

我使用 Youden's J 找到了最佳分类的阈值，即 0.63。我现在如何使用这个阈值？我认为最好的方法是以某种方式将阈值合并到 gbm 中的 caret 模型中以获得更准确的预测，然后再次在训练数据上重新运行模型？目前它默认为 0.5，我找不到更新阈值的明显方法。

或者，阈值是否仅用于将测试数据预测分成正确的类别？这似乎更直接，但我如何反映 ROC_AUC 图中的变化，假设应该根据新阈值更新概率？

如有任何帮助，我们将不胜感激。谢谢

编辑：我正在处理的完整代码如下：

  
library(datasets)
library(caret)
library(MLeval)
library(dplyr)

data(iris)
data <- as.data.frame(iris)

# create class
data$class <- ifelse(data$Species == "setosa","yes","no")

# split into train and test
train <- data %>% sample_frac(.70)
test <- data %>% sample_frac(.30)


# Set up control function for training
ctrl <- trainControl(method = "cv",number = 5,returnResamp = 'none',summaryFunction = twoClassSummary,classprobs = T,savePredictions = T,verboseIter = F)

# Set up trainng grid - this is based on a hyper-parameter tune that was recently done
gbmGrid <-  expand.grid(interaction.depth = 10,n.trees = 20000,shrinkage = 0.01,n.minobsinnode = 4) 


# Build a standard classifier using a gradient boosted machine
set.seed(5627)
gbm_iris <- train(class ~ .,data = train,method = "gbm",metric = "ROC",tuneGrid = gbmGrid,verbose = FALSE,trControl = ctrl)

# Calcuate best thresholds
caret::thresholder(gbm_iris,threshold = seq(.01,0.99,by = 0.01),final = TRUE,statistics = "all")

pred <- predict(gbm_iris,newdata = test,type = "prob")
roc <- evalm(data.frame(pred,test$class))

解决方法

您的代码中有几个问题。我将使用来自 PimaIndiansDiabetes 的 mlbench 数据集，因为它比 iris 数据集更适合。

首先用于将数据拆分为训练集和测试集的代码：

train <- data %>% sample_frac(.70)
test <- data %>% sample_frac(.30)

不适合，因为训练集中出现的一些行也会出现在测试集中。

另外避免使用函数名称作为对象名称，从长远来看，它会为您省去很多麻烦。

data(iris)
data <- as.data.frame(iris) #bad object name

举个例子：

library(caret)
library(ModelMetrics)
library(dplyr)
library(mlbench)

data(PimaIndiansDiabetes,package = "mlbench")

创建训练集和测试集，您可以使用基 R sample 对行或 caret::createDataPartition 进行采样。 createDataPartition 更可取，因为它会尝试保留响应的分布。

set.seed(123)
ind <- createDataPartition(PimaIndiansDiabetes$diabetes,0.7)


tr <- PimaIndiansDiabetes[ind$Resample1,]
ts <- PimaIndiansDiabetes[-ind$Resample1,]

这样训练集中的任何行都不会出现在测试集中。

让我们创建模型：

ctrl <- trainControl(method = "cv",number = 5,returnResamp = 'none',summaryFunction = twoClassSummary,classProbs = T,savePredictions = T,verboseIter = F)


gbmGrid <-  expand.grid(interaction.depth = 10,n.trees = 200,shrinkage = 0.01,n.minobsinnode = 4) 

set.seed(5627)
gbm_pima <- train(diabetes ~ .,data = tr,method = "gbm",#use xgboost
                  metric = "ROC",tuneGrid = gbmGrid,verbose = FALSE,trControl = ctrl)

为阈值创建一个概率向量

probs <- seq(.1,0.9,by = 0.02)

ths <- thresholder(gbm_pima,threshold = probs,final = TRUE,statistics = "all")

head(ths)

Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall        F1 Prevalence Detection Rate Detection Prevalence
1     200                10      0.01              4           0.10       1.000  0.02222222      0.6562315      1.0000000 0.6562315  1.000 0.7924209  0.6510595      0.6510595            0.9922078
2     200                10      0.01              4           0.12       1.000  0.05213675      0.6633439      1.0000000 0.6633439  1.000 0.7975413  0.6510595      0.6510595            0.9817840
3     200                10      0.01              4           0.14       0.992  0.05954416      0.6633932      0.8666667 0.6633932  0.992 0.7949393  0.6510595      0.6458647            0.9739918
4     200                10      0.01              4           0.16       0.984  0.07435897      0.6654277      0.7936508 0.6654277  0.984 0.7936383  0.6510595      0.6406699            0.9636022
5     200                10      0.01              4           0.18       0.984  0.14188034      0.6821550      0.8750000 0.6821550  0.984 0.8053941  0.6510595      0.6406699            0.9401230
6     200                10      0.01              4           0.20       0.980  0.17179487      0.6886786      0.8833333 0.6886786  0.980 0.8086204  0.6510595      0.6380725            0.9271018
  Balanced Accuracy  Accuracy      Kappa          J      Dist
1         0.5111111 0.6588517 0.02833828 0.02222222 0.9777778
2         0.5260684 0.6692755 0.06586592 0.05213675 0.9478632
3         0.5257721 0.6666781 0.06435166 0.05154416 0.9406357
4         0.5291795 0.6666781 0.07134190 0.05835897 0.9260250
5         0.5629402 0.6901572 0.15350721 0.12588034 0.8585308
6         0.5758974 0.6979836 0.18460584 0.15179487 0.8288729

根据您的首选指标提取阈值概率

ths %>%
  mutate(prob = probs) %>%
  filter(J == max(J)) %>%
  pull(prob) -> thresh_prob

thresh_prob
0.74

对测试数据的预测

pred <- predict(gbm_pima,newdata = ts,type = "prob")

根据测试集中的响应创建一个数字响应（0 或 1），因为包 ModelMetrics 中的函数需要这样做

real <- as.numeric(factor(ts$diabetes))-1

ModelMetrics::sensitivity(real,pred$pos,cutoff = thresh_prob)
0.2238806 #based on this it is clear the threshold chosen is not optimal on this test data

ModelMetrics::specificity(real,cutoff = thresh_prob)
0.956

ModelMetrics::kappa(real,cutoff = thresh_prob)
0.2144026  #based on this it is clear the threshold chosen is not optimal on this test data

ModelMetrics::mcc(real,cutoff = thresh_prob)
0.2776309  #based on this it is clear the threshold chosen is not optimal on this test data

ModelMetrics::auc(real,pred$pos)
0.8047463  #decent AUC and low mcc and kappa indicate a poor choice of threshold

Auc 是对所有阈值的度量，因此它不需要指定截止阈值。

由于仅使用了一个训练/测试拆分，因此性能评估将有偏差。最好是使用嵌套重采样，这样可以在多个训练/测试分割上评估相同的结果。 Here is a way 执行嵌套重采样。

编辑：回答评论中的问题。

要创建 roc 曲线，您无需计算所有阈值的灵敏度和特异性，您只需使用指定的包即可完成此类任务。结果是概率会更值得信赖。
我更喜欢使用 pROC 包：

library(pROC)

roc.obj <- roc(real,pred$pos)
plot(roc.obj,print.thres = "best")

图中的最佳阈值是对测试数据给出最高特异性+敏感性的阈值。很明显，与基于交叉验证预测获得的阈值 (0.74) 相比，此阈值 (0.289) 低得多。这就是我说如果您调整交叉验证预测的阈值并将由此获得的性能作为阈值成功的指标，将会出现相当大的乐观偏差的原因。

在上面的例子中，如果不调整阈值会在测试集上获得更好的性能。对于 Pima Indians 数据集，这可能适用于一般情况，或者这可能是不幸的火车/测试拆分的情况。所以最好使用嵌套重采样来验证这类事情。