如何解决为什么 LightGBM 回归给出零 SHAP 平均值?
正如您在 SHAP 瀑布图中看到的值为零,这是什么原因?零值合理吗?
这是我的数据的链接: https://github.com/kilickursat/Tunnelling/blob/main/TBM_Performance.xlsx
这是我的代码:
import numpy as np
import pandas as pd
import lightgbm
from sklearn.metrics import r2_score,mean_squared_error as MSE
from lightgbm import LGBMRegressor
import shap
import io
df2 = pd.read_excel(io.BytesIO(uploaded['TBM_Performance.xlsx'])) #Colab used
df2["ROCK_PRO"] = df2["UCS(MPa)"] / df2["BTS(MPa)"]
X = df2[["UCS(MPa)","BTS(MPa)","Fs(m)","Alpha(degree)","PI(kN/mm)","ROCK_PRO"]]
y = df2[["ROP(m/hr)"]]
print(df2)
print(X,y)
hyper_params = {
'task': 'train','boosting_type': 'goss','objective': 'regression','metric': "mse"
}
# train an LightGBM model
model = lightgbm.LGBMRegressor(**hyper_params).fit(X,y)
explainer = shap.Explainer(model)
# visualize the first prediction's explanation
shap.plots.waterfall(shap_values[0])
[![enter image description here][2]][2]
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
X = pd.DataFrame(np.c_[df2['PI(kN/mm)'],df2["ROCK_PRO"],df2["BTS(MPa)"]],columns = ['PI(kN/mm)',"ROCK_PRO","BTS(MPa)"])
y = df2['ROP(m/hr)']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=42)
model= LGBMRegressor(**hyper_params,min_data_in_leaf=0,min_sum_hessian_in_leaf=0.0).fit(X_train,y_train)
predictions = model.predict(X_test)
r2_score(predictions,y_test).round(2)
#R2_score : 0.96
解决方法
SHAP 值都为零,因为您的模型正在返回恒定的预测,因为所有样本最终都在一片叶子中。这是因为在您的数据集中您只有 18 个样本,并且默认情况下 LightGBM 在给定的叶子中至少需要 20 个样本(min_data_in_leaf
默认设置为 20)。如果您将 min_data_in_leaf
设置为较小的值,例如 3,那么您的模型将针对不同的样本返回不同的预测,并且 SHAP 值将不为零。
import pandas as pd
from lightgbm import LGBMRegressor
import shap
# import the data
df = pd.read_excel('TBM_Performance.xlsx')
df['ROCK_PRO'] = df['UCS(MPa)'] / df['BTS(MPa)']
print(df.shape[0])
# 18
# extract the features and target
X = df[['UCS(MPa)','BTS(MPa)','Fs(m)','Alpha(degree)','PI(kN/mm)','ROCK_PRO']]
y = df[['ROP(m/hr)']]
# train the model with min_data_in_leaf=20
hyper_params = {
'task': 'train','boosting_type': 'goss','objective': 'regression','metric': 'mse',}
model = LGBMRegressor(**hyper_params).fit(X,y)
print(model.predict(X))
# [2.52277776 2.52277776 2.52277776 2.52277776 2.52277776 2.52277776
# 2.52277776 2.52277776 2.52277776 2.52277776 2.52277776 2.52277776
# 2.52277776 2.52277776 2.52277776 2.52277776 2.52277776 2.52277776]
# train the model with min_data_in_leaf=3
hyper_params = {
'task': 'train','min_data_in_leaf': 3,y)
print(model.predict(X))
# [2.21428748 2.21428748 2.21428748 2.68171691 2.36794282 2.37986215
# 2.37986215 2.77942405 2.84938042 2.84938042 2.8104722 2.8104722
# 2.50056257 2.47946274 2.46754341 2.58446466 2.58446466 2.24212594]
explainer = shap.Explainer(model)
shap_values = explainer(X)
shap.plots.waterfall(shap_values[0])
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。