如何以小图绘制错误分类的样本？

如何解决如何以小图绘制错误分类的样本？

我有一个基因数据集，该基因的得分在致病可能性的0到1之间（已知得分为1的基因会引起疾病，得分为0.74的基因很可能导致疾病）。我正在尝试建立一个机器学习模型，以预测回归分类中新基因的疾病评分。

我想查看已知疾病基因但被打分较低的基因（例如，被打分1但我的模型得分低于0.8的基因）的简单决策图。我正在努力将这些基因组合在一起以作图。

我的数据如下：

X:
Index   Feature1  Feature2   ... FeatureN
Gene1     1           0.2          10
Gene2     1           0.1          7
Gene3     0           0.3          10
#index is actually the index and not a column

Y:
score
1
0.6
0.4

我运行带有嵌套交叉验证的xgboost回归器，查看MSE，预测的r2，并绘制观察值与期望值的关系图。我可以在观察到的预期曲线图中看到，Y得分为1的基因具有模型预测的许多低分，我想了解为什么模型使用shap来做到这一点。我不能给出示例数据。

我正在尝试修改为标签分类提供的示例代码：

import shap

xgbr = xgboost.XGBRegressor()
xgbr.fit(X_train,Y_train)

select = range(8) #I have 8 features after feature selection with BorutaShap
features = X.iloc[select]
features_display = X.loc[features.index]

explainer = shap.TreeExplainer(xgbr)
expected_value = explainer.expected_value

#Example code from https://slundberg.github.io/shap/notebooks/plots/decision_plot.html: 

y_pred = xgbr.predict(X) 
y_pred = (shap_values.sum(1) + expected_value) > 0
misclassified = y_pred != y_test[select]
shap.decision_plot(expected_value,shap_values,features_display,link='logit',highlight=misclassified)

如何选择y_pred，以便预测/基因本来是1，但实际上低于0.8（或任何低数）？

编辑：针对给定的答案，我尝试了：

explainer = shap.TreeExplainer(xgbr)
shap_values = explainer.shap_values(X_test)

y_pred = xgbr.predict(X_test)
m = (y_pred <= 0.5) & (Y_test == 1)

shap.initjs()
shap.decision_plot(explainer.expected_value,X_test[m],return_objects=True)

这会运行，但是m的长度为171（Y_test数据中的所有行数），然后该图将所有171看起来都绘制了出来-从数据中我知道应该只有

解决方法

首先，您提到预测回归分类中新基因的疾病评分，这是什么意思？输出似乎是二进制的0或1，因此这是二进制分类问题。您应该改用xgboost的分类器。 Update ：尽管如此，根据注释，我们还是假设有一个回归问题来模拟您的案例。尽管在下面的示例中，我们应该设置'objective':'multi:softmax'来输出实际的标签。

根据您的问题，似乎您要尝试的是对未正确预测的样本建立测试集索引，并分析误导性功能，这很合理。感觉。

让我们用一些示例数据集重现您的问题：

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split
import shap
import xgboost

X,y = shap.datasets.iris()
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)

model = xgboost.train(params={"learning_rate": 0.01},dtrain=xgboost.DMatrix(X_train,label=y_train),num_boost_round =100)

使用整个测试集的SHAP图很简单。以force_plot为例：

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

shap.initjs()
shap.force_plot(explainer.expected_value,shap_values,X_test)

现在，如果我们要对未分类的样本执行相同的操作，则需要查看输出概率。由于虹膜数据集具有多个类别，假设我们要可视化应该归类为force_plot的那些样本的2，但是下面是一个1.7的输出值：

y_pred = model.predict(xgboost.DMatrix(X_test))
m = (y_pred <= 1.7) & (y_test == 2)

现在，让我们使用掩码对X_test集执行布尔索引，并更新shap_values：

shap.initjs()
c= explainer.shap_values(X_test[m])
shap.force_plot(explainer.expected_value,X_test[m])

这告诉我们，花瓣长度和宽度将回归推向更高的值。因此，它们大概是在错误分类中起主要作用的变量。

类似地，对于decision_plot：

shap.decision_plot(explainer.expected_value,X_test[m],feature_order='hclust',return_objects=True)

由于我没有您的数据集，因此无法检查代码，但是这里有些想法可能会告诉您方向。

似乎您没有训练回归器。应该像这样

xgbr = xgboost.XGBRegressor()
xgbr.train(X,Y)

现在您可以使用xgbr.predict(X);）

您还需要培训讲解员：

explainer = shap.TreeExplainer(xgbr)
with warnings.catch_warnings():
     warnings.simplefilter("ignore")
     sh = explainer.shap_values(X)

现在您可以选择值：

misclassified = (y_pred <= 0.7) & (Y == 1)
shap.decision_plot(expected_value,sh,features_display,link='logit',highlight=misclassified)

在使用shap之前，建议您检查一下回归器对数据的拟合程度。因此，为此，我建议您参加测试的部分数据不，不要在训练中使用它。然后，您可以通过计算和比较测试集和训练集上的MSE来评估拟合优度。差异越大，预测变量的性能就越差。