如何解决使用SHAP时如何解释GBT分类器的base_value?
我最近发现了this amazing library for ML interpretability。我决定使用sklearn中的toy dataset构建一个简单的xgboost分类器,并绘制一个force_plot
。
要了解图谱,图书馆说:
以上说明显示了每个功能都会推动 基本值的模型输出(在 我们传递的训练数据集)到模型输出。功能推动 较高的预测显示为红色,那些将预测降低的 是蓝色的(这些力图已在我们的Nature BME中引入 纸)。
所以在我看来,base_value应该与clf.predict(X_train).mean()
相同,等于0.637。但是,在查看绘图时情况并非如此,该数字实际上甚至不在[0,1]之内。我尝试以不同的基础(10,e,2)进行日志记录,假设这将是某种单调变换...但还是不走运。如何获得该base_value?
!pip install shap
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
X,y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train,y_train)
print(clf.predict(X_train).mean())
# load JS visualization code to notebook
shap.initjs()
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_train)
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value,shap_values[0,:],X_train.iloc[0,:])
解决方法
要在原始空间中获得base_value
(在link="identity"
时),您需要解开类标签->概率->原始分数。请注意,默认损耗为"deviance"
,因此原始损耗为反S形:
# probabilites
y = clf.predict_proba(X_train)[:,1]
# raw scores,default link="identity"
y_raw = np.log(y/(1-y))
# expected raw score
print(np.mean(y_raw))
print(np.isclose(explainer.expected_value,np.mean(y_raw),1e-12))
2.065861773054686
[ True]
原始空间中第0个数据点的相关图:
shap.force_plot(explainer.expected_value[0],shap_values[0,:],X_train.iloc[0,link="identity")
您是否希望切换到S型概率空间(link="logit"
):
from scipy.special import expit,logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability,i.e. base value in probability spacy
print(expit(y_raw))
0.8875405774316522
概率空间中第0个数据点的相关图:
请注意,从夏普的角度来看,概率base_value
(如果没有可用数据,他们称之为基准概率),不是理性的人没有独立变量(0.6373626373626373
来定义的概率) )
完整的示例:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
print(shap.__version__)
X,y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train,y_train.values.ravel())
# load JS visualization code to notebook
shap.initjs()
explainer = shap.TreeExplainer(clf,model_output="raw")
shap_values = explainer.shap_values(X_train)
from scipy.special import expit,i.e. base value in probability spacy
print("Expected raw score (before sigmoid):",y_raw)
print("Expected probability:",expit(y_raw))
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value[0],link="logit")
输出:
0.36.0
Expected raw score (before sigmoid): 2.065861773054686
Expected probability: 0.8875405774316522
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。