使用SHAP时如何解释GBT分类器的base_value？

Question

使用SHAP时如何解释GBT分类器的base_value？

G. *_*cia 4 python machine-learning scikit-learn shap

我最近发现了这个令人惊叹的 ML 可解释性库。我决定使用 sklearn 的玩具数据集构建一个简单的 xgboost 分类器，并绘制一个force_plot.

为了理解这个情节，图书馆说：

上面的解释显示了每个有助于将模型输出从基值（我们传递的训练数据集上的平均模型输出）推送到模型输出的功能。将预测推高的特征以红色显示，将预测推低的特征以蓝色显示（这些力图在我们的 Nature BME 论文中介绍）。

所以在我看来，base_value 应该与clf.predict(X_train).mean()0.637 相同。然而，从绘图来看，情况并非如此，数字实际上不在 [0,1] 之内。我尝试以不同的基础（10，e，2）进行记录，假设这将是某种单调变换......但仍然不走运。我怎样才能得到这个base_value？

!pip install shap

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap

X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train)

print(clf.predict(X_train).mean())

# load JS visualization code to notebook
shap.initjs()

explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_train)

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X_train.iloc[0,:])

Run Code Online (Sandbox Code Playgroud)

Answer 1

Ser*_*nov 6

要base_value进入原始空间（当时link="identity"），您需要展开类标签 --> 概率 --> 原始分数。注意，默认损失为"deviance"，因此原始数据是反 sigmoid：

# probabilites
y = clf.predict_proba(X_train)[:,1]
# raw scores, default link="identity"
y_raw = np.log(y/(1-y))
# expected raw score
print(np.mean(y_raw))
print(np.isclose(explainer.expected_value, np.mean(y_raw), 1e-12))
2.065861773054686
[ True]

Run Code Online (Sandbox Code Playgroud)

原始空间中第 0 个数据点的相关图：

shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="identity")

Run Code Online (Sandbox Code Playgroud)

如果您希望切换到 sigmoid 概率空间 ( link="logit")：

from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print(expit(y_raw))
0.8875405774316522

Run Code Online (Sandbox Code Playgroud)

概率空间中第 0 个数据点的相关图：

请注意，base_value从 shap 的角度来看，如果没有可用数据，他们称之为基线概率，这并不是一个理性的人通过没有自变量来定义的概率（0.6373626373626373在这种情况下）

完整的可重现示例：

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
import shap
print(shap.__version__)

X, y = load_breast_cancer(return_X_y=True)
X = pd.DataFrame(data=X)
y = pd.DataFrame(data=y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf = GradientBoostingClassifier(random_state=0)
clf.fit(X_train, y_train.values.ravel())

# load JS visualization code to notebook
shap.initjs()

explainer = shap.TreeExplainer(clf, model_output="raw")
shap_values = explainer.shap_values(X_train)

from scipy.special import expit, logit
# probabilites
y = clf.predict_proba(X_train)[:,1]
# exected raw base value
y_raw = logit(y).mean()
# expected probability, i.e. base value in probability spacy
print("Expected raw score (before sigmoid):", y_raw)
print("Expected probability:", expit(y_raw))

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value[0], shap_values[0,:], X_train.iloc[0,:], link="logit")

Run Code Online (Sandbox Code Playgroud)

输出：

0.36.0
Expected raw score (before sigmoid): 2.065861773054686
Expected probability: 0.8875405774316522

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年前
查看次数：	2237 次
最近记录：	5 年前