我曾经shap确定具有相关特征的多元回归的特征重要性。
import numpy as np
import pandas as pd  
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import shap
boston = load_boston()
regr = pd.DataFrame(boston.data)
regr.columns = boston.feature_names
regr['MEDV'] = boston.target
X = regr.drop('MEDV', axis = 1)
Y = regr['MEDV']
fit = LinearRegression().fit(X, Y)
explainer = shap.LinearExplainer(fit, X, feature_dependence = 'independent')
# I used 'independent' because the result is consistent with the ordinary 
# shapely values where `correlated' is not
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X, plot_type = 'bar') …Run Code Online (Sandbox Code Playgroud) 我正在尝试在 Python 中绘制一些SHAP 图,以更深入地了解我的机器学习模型的输出。这是我在 for 循环中调用的方法:
def plotAndSaveSHAPSummary(model,train_data,x_train,pathToSHAPPlots):
    shap_values = model.get_feature_importance(train_data, type='ShapValues')
    expected_value = shap_values[0,-1]
    shap_values = shap_values[:,:-1]
    shap.summary_plot(shap_values,x_train,max_display=20,show=False)
    plt.savefig(pathToSHAPPlots+'/SHAP Plots/SHAP_Plot'+str(counter)+'.png',dpi=300,bbox_inches='tight')
    plt.clf()
Run Code Online (Sandbox Code Playgroud)
绘图按预期保存到磁盘,但在每次调用 savefig 方法后,我收到以下错误消息:
Exception in Tkinter callback
Traceback (most recent call last):
  File "D:\PathTo\Anaconda\Lib\tkinter\__init__.py", line 1705, in __call__
    return self.func(*args)
  File "D:\PathTo\Anaconda\Lib\tkinter\__init__.py", line 749, in callit
    func(*args)
  File "D:\PathTo\Anaconda\lib\site-packages\matplotlib\backends\_backend_tk.py", line 270, in idle_draw
    self.draw()
  File "D:\PathTo\Anaconda\lib\site-packages\matplotlib\backends\backend_tkagg.py", line 9, in draw
    super(FigureCanvasTkAgg, self).draw()
  File "D:\PathTo\Anaconda\lib\site-packages\matplotlib\backends\backend_agg.py", line 393, in draw
    self.figure.draw(self.renderer)
  File "D:\PathTo\Anaconda\lib\site-packages\matplotlib\backend_bases.py", line 1535, in _draw
    def …Run Code Online (Sandbox Code Playgroud) 我一直在尝试计算 Python 中 H2O 模块中梯度增强分类器的 SHAP 值。下面是该predict_contibutions方法的文档中的改编示例(改编自https://github.com/h2oai/h2o-3/blob/master/h2o-py/demos/predict_contributionsShap.ipynb)。
import h2o
import shap
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o import H2OFrame
# initialize H2O
h2o.init()
# load JS visualization code to notebook
shap.initjs()
# Import the prostate dataset
h2o_df = h2o.import_file("https://raw.github.com/h2oai/h2o/master/smalldata/logreg/prostate.csv")
# Split the data into Train/Test/Validation with Train having 70% and test and validation 15% each
train,test,valid = h2o_df.split_frame(ratios=[.7, .15])
# Convert the response column to a factor
h2o_df["CAPSULE"] = h2o_df["CAPSULE"].asfactor()
# Generate a GBM model using …Run Code Online (Sandbox Code Playgroud) 在下面的代码中,我导入了一个用 python 创建的保存的稀疏 numpy 矩阵,对其进行增密,向多对一的 SimpleRNN 添加掩蔽、batchnorm 和密集输出层。keras 顺序模型工作正常,但是,我无法使用 shap。这是在 Windows 10 桌面上的 Winpython 3830 的 Jupyter 实验室中运行的。X 矩阵的形状为 (4754, 500, 64):4754 个示例,具有 500 个时间步长和 64 个变量。我创建了一个函数来模拟数据,以便可以测试代码。模拟数据返回相同的错误。
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.models import Sequential
import tensorflow.keras.backend as Kb
from tensorflow.keras import layers
from tensorflow.keras.layers import BatchNormalization
from tensorflow import keras as K
import numpy as np
import shap
import random
def create_x():
    dims = [10,500,64]
    data = []
    y = []
    for i in range(dims[0]): …Run Code Online (Sandbox Code Playgroud) 假设我们有一个二元分类问题,我们有两个类别 1 和 0 作为我们的目标。我的目标是使用树分类器来预测给定特征的 1 和 0。此外,我可以使用 SHAP 值对预测 1 和 0 的特征重要性进行排名。到现在为止一切都很好!
现在假设我想知道仅预测 1 的特征的重要性,那里推荐的方法是什么?我可以将我的数据分成两部分(名义上:)df_tot = df_zeros + df_ones并df_ones在我的分类器中使用,然后为此提取 SHAP 值,但是这样做目标将只有 1,因此模型并没有真正学会分类任何东西。所以我想知道如何解决这样的问题?
我试图为本地解释的单行创建形状值,但我一直收到此错误。我尝试了各种方法但仍然无法修复它们。
到目前为止我所做的事情 -
创建了随机决策树模型 -
from sklearn.ensemble import ExtraTreesRegressor
extra_tree = ExtraTreesRegressor(random_state=42)
extra_tree.fit(X_train, y_train)
Run Code Online (Sandbox Code Playgroud)
然后尝试计算形状值 -
# create a explainer object
explainer = shap.Explainer(extra_tree)    
explainer.expected_value
array([15981.25812347])
#calculate shap value for a single row
shap_values = explainer.shap_values(pd.DataFrame(X_train.iloc[9274]).T)
Run Code Online (Sandbox Code Playgroud)
这给了我这个错误 -
Exception: Additivity check failed in TreeExplainer! Please ensure the data matrix you passed to the explainer is the same shape that the model was trained on. If your data shape is correct then please report this on GitHub. Consider retrying with the …Run Code Online (Sandbox Code Playgroud) 我想做一个简单的形状分析并绘制 shap.force_plot。我注意到它在 .ipynb 文件中本地工作没有任何问题,但在 Databricks 上失败并显示以下错误消息:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must 
also trust this notebook (File -> Trust notebook). If you are viewing  this notebook on 
github the Javascript has been stripped for security. If you are using JupyterLab this 
error is because a JupyterLab extension has not yet been written.
Run Code Online (Sandbox Code Playgroud)
代码:
import xgboost
import shap 
shap.initjs()
X, y = shap.datasets.boston()
bst = …Run Code Online (Sandbox Code Playgroud) 对于特定的预测问题,我观察到某个变量在生成的 XGBoost 特征重要性(基于增益)中排名较高,而在 SHAP 输出中排名相当低。
这该如何解释呢?例如,变量对于我们的预测问题是否非常重要?
我想为 GBM 模型创建特征重要性的形状图:
ctrlCV = trainControl(method = 'repeatedcv', repeats = 5 , number = 10 , classProbs = TRUE , savePredictions = TRUE, summaryFunction = twoClassSummary )
gbmFit = train(CR~., data = training_set,
               method = "gbm",
               metric="ROC",
               trControl = ctrlCV,
               tuneGrid = gbmGRID,
               verbose = FALSE)
Run Code Online (Sandbox Code Playgroud)
然而,我找到的所有示例都是针对 xgboost 模型、SHAPforxgboost 和 shapr 等软件包,对我不起作用。例如:
shap_values <- shap.values(xgb_model = gbm_fit, X_train = tarining_set)
Run Code Online (Sandbox Code Playgroud)
产生和错误:
error in `colnames<-`(`*tmp*`, value = c(colnames(x_train), "bias")) : attempt to set 'colnames' on an object with less than two …Run Code Online (Sandbox Code Playgroud) 存在二元分类问题:如何获得 Ranger 模型变量的 Shap 贡献?
样本数据:
library(ranger)
library(tidyverse)
# Binary Dataset
df <- iris
df$Target <- if_else(df$Species == "setosa",1,0)
df$Species <- NULL
# Train Ranger Model
model <- ranger(
  x = df %>%  select(-Target),
  y = df %>%  pull(Target))
Run Code Online (Sandbox Code Playgroud)
我尝试过几个库(DALEX、、、)shapr,但没有得到任何解决方案。fastshapshapper
SHAPforxgboost我希望得到像xgboost 这样的结果:
shap.values是变量的形状贡献shap.plot.summaryshap ×10
python ×7
matplotlib ×2
r ×2
data-science ×1
databricks ×1
gbm ×1
h2o ×1
iris-dataset ×1
keras ×1
r-ranger ×1
tensorflow ×1
tkinter ×1
xgboost ×1