使用 Shap 获取预测解释的正确方法是什么？

Question

使用 Shap 获取预测解释的正确方法是什么？

rad*_*lo 6 python machine-learning scikit-learn shap

我刚开始使用shap，所以我仍在努力理解它。基本上，我有一个简单的sklearn.ensemble.RandomForestClassifier使用model.fit(X_train,y_train)，等等。训练后，我想获得 Shap 值来解释对未见数据的预测。根据文档和其他教程，这似乎是要走的路：

explainer = shap.Explainer(model.predict, X_train)
shap_values = explainer.shap_values(X_test)

Run Code Online (Sandbox Code Playgroud)

然而，这需要很长时间才能运行（我的数据大约需要 18 小时）。如果我将第一行中的替换为，即model.predict：model

explainer = shap.Explainer(model, X_train)
shap_values = explainer.shap_values(X_test)

Run Code Online (Sandbox Code Playgroud)

它显着缩短了运行时间（减少至约 40 分钟）。所以这让我想知道在第二种情况下我实际上得到了什么？

重申一下，我只是想能够解释新的预测，而且对我来说这似乎很奇怪，它会如此昂贵 - 所以我确信我做错了什么。

Answer 1

Ser*_*nov 10

我认为你的问题已经包含了一个提示：

explainer = shap.Explainer(model.predict, X_train)
shap_values = explainer.shap_values(X_test)

Run Code Online (Sandbox Code Playgroud)

是昂贵的，并且很可能是一种从函数中计算 Shapely 值的精确算法。

explainer = shap.Explainer(model, X_train)
shap_values = explainer.shap_values(X_test)

Run Code Online (Sandbox Code Playgroud)

平均来自训练模型的可用预测。

为了证明第一个主张（第二个是事实）让我们研究一下Explainer类的源代码。

类定义：

class Explainer(Serializable):
    """ Uses Shapley values to explain any machine learning model or python function.
    This is the primary explainer interface for the SHAP library. It takes any combination
    of a model and masker and returns a callable subclass object that implements
    the particular estimation algorithm that was chosen.
    """

    def __init__(self, model, masker=None, link=links.identity, algorithm="auto", output_names=None, feature_names=None, linearize_link=True,
                 seed=None, **kwargs):
        """ Build a new explainer for the passed model.
        Parameters
        ----------
        model : object or function
            User supplied function or model object that takes a dataset of samples and
            computes the output of the model for those samples.

Run Code Online (Sandbox Code Playgroud)

所以，现在您知道可以提供模型或函数作为第一个参数。

如果Pandas 作为 masker 提供：

        if safe_isinstance(masker, "pandas.core.frame.DataFrame") or \
                ((safe_isinstance(masker, "numpy.ndarray") or sp.sparse.issparse(masker)) and len(masker.shape) == 2):
            if algorithm == "partition":
                self.masker = maskers.Partition(masker)
            else:
                self.masker = maskers.Independent(masker)

Run Code Online (Sandbox Code Playgroud)

最后，如果提供了可调用的：

                elif callable(self.model):
                    if issubclass(type(self.masker), maskers.Independent):
                        if self.masker.shape[1] <= 10:
                            algorithm = "exact"
                        else:
                            algorithm = "permutation"

Run Code Online (Sandbox Code Playgroud)

希望您现在明白为什么第一个是精确的（因此需要很长时间才能计算）。

现在回答你的问题：

使用 Shap 获取预测解释的正确方法是什么？

和

所以这让我想知道在第二种情况下我实际上得到了什么？

如果您有一个由 SHAP 支持的模型（树、线性等），请使用：

explainer = shap.Explainer(model, X_train)
shap_values = explainer.shap_values(X_test)

Run Code Online (Sandbox Code Playgroud)

这些是从模型中提取的 SHAP 值，这就是其SHAP存在的原因。

如果不支持，请使用第一个。

两者应该给出相似的结果。

归档时间：	3 年，3 月前
查看次数：	3625 次
最近记录：	3 年，2 月前