使用SHAP时如何解释GBT分类器的base_value？

如何解决使用SHAP时如何解释GBT分类器的base_value？

我最近发现了this amazing library for ML interpretability。我决定使用sklearn中的toy dataset构建一个简单的xgboost分类器，并绘制一个force_plot。

要了解图谱，图书馆说：

以上说明显示了每个功能都会推动基本值的模型输出（在我们传递的训练数据集）到模型输出。功能推动较高的预测显示为红色，那些将预测降低的是蓝色的（这些力图已在我们的Nature BME中引入纸）。

所以在我看来，base_value应该与clf.predict(X_train).mean()相同，等于0.637。但是，在查看绘图时情况并非如此，该数字实际上甚至不在[0,1]之内。我尝试以不同的基础（10，e，2）进行日志记录，假设这将是某种单调变换...但还是不走运。如何获得该base_value？

!pip install shap

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap

X,y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train,y_train)

print(clf.predict(X_train).mean())

# load JS visualization code to notebook
shap.initjs()

explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_train)

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value,shap_values[0,:],X_train.iloc[0,:])

解决方法

要在原始空间中获得base_value（在link="identity"时），您需要解开类标签->概率->原始分数。请注意，默认损耗为"deviance"，因此原始损耗为反S形：

# probabilites
y = clf.predict_proba(X_train)[:,1]
# raw scores,default link="identity"
y_raw = np.log(y/(1-y))
# expected raw score
print(np.mean(y_raw))
print(np.isclose(explainer.expected_value,np.mean(y_raw),1e-12))
2.065861773054686
[ True]

原始空间中第0个数据点的相关图：

shap.force_plot(explainer.expected_value[0],shap_values[0,:],X_train.iloc[0,link="identity")

您是否希望切换到S型概率空间（link="logit"）：

from scipy.special import expit,logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability,i.e. base value in probability spacy
print(expit(y_raw))
0.8875405774316522

概率空间中第0个数据点的相关图：

请注意，从夏普的角度来看，概率base_value（如果没有可用数据，他们称之为基准概率），不是理性的人没有独立变量（0.6373626373626373来定义的概率））

完整的示例：

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
print(shap.__version__)

X,y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train,y_train.values.ravel())

# load JS visualization code to notebook
shap.initjs()

explainer = shap.TreeExplainer(clf,model_output="raw")
shap_values = explainer.shap_values(X_train)

from scipy.special import expit,i.e. base value in probability spacy
print("Expected raw score (before sigmoid):",y_raw)
print("Expected probability:",expit(y_raw))

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value[0],link="logit")

输出：

0.36.0
Expected raw score (before sigmoid): 2.065861773054686
Expected probability: 0.8875405774316522

使用SHAP时如何解释GBT分类器的base_value？

如何解决使用SHAP时如何解释GBT分类器的base_value？

解决方法

相关推荐