scikit-learn 中的plot_partial_dependence() 错误地为正确拟合的模型引发 NotFittedError(例如 KerasRegressor 或 LGBMClassifier)

DrS*_*ich 5 python validation scikit-learn

我正在尝试使用sklearn.inspection.plot_partial_dependence我使用 keras 和 keras sklearn 包装实用程序成功构建的模型来创建部分依赖图(请参阅下面的代码块)。包装后的模型构建成功,可以使用fit方法,拟合后可以使用predict方法,得到预期的结果。所有迹象都表明它是一个有效的估计器。然而,当我尝试从 sklearn.inspection 运行 plot_partial_dependence 时,我收到一些错误文本,暗示它不是有效的估计器,尽管我可以证明它是有效的。

我使用 sklearn 波士顿住房数据示例对此进行了编辑,以便更容易重现。

from sklearn.datasets import load_boston
from sklearn.inspection import plot_partial_dependence, partial_dependence
from keras.wrappers.scikit_learn import KerasRegressor
import keras
import tensorflow as tf
import pandas as pd

boston = load_boston()
feature_names = boston.feature_names
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
mean = X.describe().transpose()['mean']
std = X.describe().transpose()['std']
X_norm = (X-mean)/std

def build_model_small():
    model = keras.Sequential([
        keras.layers.Dense(64, activation='relu', input_shape=[len(X.keys())]),
        keras.layers.Dense(64, activation='relu'),
        keras.layers.Dense(1)
        ])

    optimizer = keras.optimizers.RMSprop(0.0005)

    model.compile(loss='mse',
              optimizer=optimizer,
              metrics=['mae', 'mse', 'mape'])
    return model


kr = KerasRegressor(build_fn=build_model_small,verbose=0)
kr.fit(X_norm,y, epochs=100, validation_split = 0.2)
pdp_plot = plot_partial_dependence(kr,X_norm,feature_names)
Run Code Online (Sandbox Code Playgroud)

就像我说的,如果我运行kr.predict(X.head(20)),我会得到y前 20 行的 20 个值预测X,正如人们对有效估计器的期望一样。

但我从plot_partial_dependence得到的错误文本如下:

Traceback (most recent call last):
  File "temp_ML_tf_sklearn_postproc.py", line 79, in <module>
    pdp_plot = plot_partial_dependence(kr,X,labels[:-1])
  File "/home/mymachine/anaconda3/lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py", line 678, in plot_partial_dependence
    for fxs in features)
  File "/home/mymachine/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 921, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/mymachine/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/mymachine/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 716, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/mymachine/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/home/mymachine/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 549, in __init__
    self.results = batch()
  File "/home/mymachine/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 225, in __call__
    for func, args, kwargs in self.items]
  File "/home/mymachine/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 225, in <listcomp>
    for func, args, kwargs in self.items]
  File "/home/mymachine/anaconda3/lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py", line 307, in partial_dependence
    "'estimator' must be a fitted regressor or classifier."
ValueError: 'estimator' must be a fitted regressor or classifier.
Run Code Online (Sandbox Code Playgroud)

我查看了plot_partial_dependence的源代码,它有以下内容。首先,在文档字符串中,它说第一个输入estimator必须是......

  A fitted estimator object implementing :term:`predict`,
    :term:`predict_proba`, or :term:`decision_function`.
    Multioutput-multiclass classifiers are not supported.
Run Code Online (Sandbox Code Playgroud)

我的估算器确实实现了.predict。

其次,在 errr 回溯中调用的行调用一个检查器来检查它是回归器还是分类器:

if not (is_classifier(estimator) or is_regressor(estimator)):
    raise ValueError(
        "'estimator' must be a fitted regressor or classifier."
    )
Run Code Online (Sandbox Code Playgroud)

我查看了 is_regressor() 的源代码,它是一个单行代码,如下所示:

return getattr(estimator, "_estimator_type", None) == "regressor"
Run Code Online (Sandbox Code Playgroud)

所以我试图通过做来破解它,setattr(mp,'_estimator_type','regressor')它只是说Attribute Error: can't set attribute,所以这是一种廉价的解决方法,但不起作用。

我什至尝试了更黑客的修复方法,并暂时注释掉了 _partial_dependence.py 源代码中的违规检查(我在上面复制的 if 语句),并收到以下错误:

Traceback (most recent call last):
  File "temp_ML_tf_sklearn_postproc.py", line 79, in <module>
    pdp_plot = plot_partial_dependence(kr,X,labels[:-1])
  File "/home/billy/anaconda3/lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py", line 678, in plot_partial_dependence
    for fxs in features)
  File "/home/billy/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 921, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/billy/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/billy/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 716, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/billy/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/home/billy/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 549, in __init__
    self.results = batch()
  File "/home/billy/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 225, in __call__
    for func, args, kwargs in self.items]
  File "/home/billy/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 225, in <listcomp>
    for func, args, kwargs in self.items]
  File "/home/billy/anaconda3/lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py", line 317, in partial_dependence
    check_is_fitted(est)
  File "/home/billy/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 967, in check_is_fitted
    raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: This KerasRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Run Code Online (Sandbox Code Playgroud)

这又回到了 sklearn 函数不认为这个模型适合的问题,而实际上它确实适合。无论如何,在这一点上,我决定不再尝试任何更危险的、修改源代码的黑客修复。

我还尝试kr.fit(X,y,etc...)直接作为plot_partial_dependence 的第一个参数传递。计算机旋转了几分钟,表明拟合实际上正在运行,但是当它尝试运行部分依赖图时,我遇到了相同的错误。

还有一个相当令人困惑的线索。我尝试在另一个 sklearn 函数中完全使用 keras/sklearn 包装管道,看看它是否可以与任何 sklearn 实用程序一起使用。这次,我做了:

from sklearn.model_selection import cross_validate
cv_scores = cross_validate(kr,X_norm,y, cv=4, return_train_score=True, n_jobs=-1)`
Run Code Online (Sandbox Code Playgroud)

成功了!所以我不认为我的使用有本质上的错误keras.wrappers.scikit_learn.KerasRegressor

这可能只是一种情况,我想做的是在plot_partial_dependence源代码中没有专门计划的边缘情况,我运气不好,但我想知道是否有其他人看到过这样的问题并且有解决方案或解决方法。

顺便说一下,我正在使用 sklearn 0.22.1 和 Python 3.7.3 (Anaconda)。需要明确的是,我在 sklearn 构建的模型甚至管道上使用了plot_partial_dependence。这个问题仅发生在基于 keras 的模型上。非常感谢人们提供的任何意见。

编辑:

这个问题的先前版本涉及使用 StandardScaler() 构建管道,然后构建 KerasRegressor 包装对象。从那时起,我发现即使仅使用 KerasRegressor 对象也会发生这种情况,即我已将问题隔离到该对象,而不是管道。因此,正如一位评论者所建议的那样,我将管道部分从问题中删除,以使其更简单、更切题。

DrS*_*ich 1

我最终找到了一种廉价的解决方法,并且它成功地适用于这个特定的情况。然而,这不是一个非常令人满意的答案,我也不能保证它适用于所有情况,所以如果有人有更通用的答案,我很乐意看到更好的答案。但我会将其发布在这里,以防其他人需要解决这个问题。

我只是将源代码(在我的 anaconda 安装中,位于~/anaconda3/lib/python3.7/site-packages/sklearn/inspection/_partial_dependence.py)复制到我的项目目录中名为 custom_pdp.py 的文件中,其中注释掉了有问题的部分,因为我(并且在必要时,硬编码了我自己的替身)值)。

在我的代码中,我使用了导入行import custom_pdp as cpdp而不是从 sklearn 导入它,然后将plot_partial_dependence 称为cpdp.plot_partial_dependence(...)

以下是我必须从该源文件中更改的行。请注意,您将需要复制整个源文件,因为其中还需要定义其他函数,但我只进行了如下所示的更改。另外,这是使用 sklearn 0.22.1 完成的 - 它可能不适用于其他版本。

首先,您必须更改顶部的相对导入行,如下所示:

from sklearn.utils.extmath import cartesian
from sklearn.utils import check_array
from sklearn.utils import check_matplotlib_support  # noqa
from sklearn.utils import _safe_indexing
from sklearn.utils import _determine_key_type
from sklearn.utils import _get_column_indices
from sklearn.utils.validation import check_is_fitted
from sklearn.tree._tree import DTYPE
from sklearn.exceptions import NotFittedError
from sklearn.ensemble._gb import BaseGradientBoosting
from sklearn.ensemble._hist_gradient_boosting.gradient_boosting import (
    BaseHistGradientBoosting)
Run Code Online (Sandbox Code Playgroud)

(它们以前是相对路径,例如from ..utils.extmath import cartesian等)

那么,唯一改变的功能是:

_partial_dependence_brute

def _partial_dependence_brute(est, grid, features, X, response_method):

    ... (skipping docstring)

    averaged_predictions = []

    # define the prediction_method (predict, predict_proba, decision_function).
    # if is_regressor(est):
    #     prediction_method = est.predict
    # else:
    #     predict_proba = getattr(est, 'predict_proba', None)
    #     decision_function = getattr(est, 'decision_function', None)
    #     if response_method == 'auto':
    #         # try predict_proba, then decision_function if it doesn't exist
    #         prediction_method = predict_proba or decision_function
    #     else:
    #         prediction_method = (predict_proba if response_method ==
    #                              'predict_proba' else decision_function)
    #     if prediction_method is None:
    #         if response_method == 'auto':
    #             raise ValueError(
    #                 'The estimator has no predict_proba and no '
    #                 'decision_function method.'
    #             )
    #         elif response_method == 'predict_proba':
    #             raise ValueError('The estimator has no predict_proba method.')
    #         else:
    #             raise ValueError(
    #                 'The estimator has no decision_function method.')
    prediction_method = est.predict

    #the rest in this function are as they were before, beginning with:
    for new_values in grid:
        X_eval = X.copy()

        ....
Run Code Online (Sandbox Code Playgroud)

然后注释掉 定义的前20行partial_dependence

def partial_dependence(estimator, X, features, response_method='auto',
                   percentiles=(0.05, 0.95), grid_resolution=100,
                   method='auto'):
    ... (skipping docstring)
    # if not (is_classifier(estimator) or is_regressor(estimator)):
    #     raise ValueError(
    #         "'estimator' must be a fitted regressor or classifier."
    #     )
    # 
    # if isinstance(estimator, Pipeline):
    #     # TODO: to be removed if/when pipeline get a `steps_` attributes
    #     # assuming Pipeline is the only estimator that does not store a new
    #     # attribute
    #     for est in estimator:
    #         # FIXME: remove the None option when it will be deprecated
    #         if est not in (None, 'drop'):
    #             check_is_fitted(est)
    # else:
    #     check_is_fitted(estimator)
    # 
    # if (is_classifier(estimator) and
    #         isinstance(estimator.classes_[0], np.ndarray)):
    #     raise ValueError(
    #         'Multiclass-multioutput estimators are not supported'
    #     )

    #The rest of the function continues as it was:
    # Use check_array only on lists and other non-array-likes / sparse. Do not
    # convert DataFrame into a NumPy array.
    if not(hasattr(X, '__array__') or sparse.issparse(X)):
        X = check_array(X, force_all_finite='allow-nan', dtype=np.object)

        ....
Run Code Online (Sandbox Code Playgroud)

如果您的模型属于不同类型或者您使用不同的参数,则可能需要进行其他更改。

在我的模型上,它的工作原理与我所希望的完全一样。但就像我说的,这是一种解决方法,并不是最令人满意的解决方案。此外,您的成功可能会因您尝试使用的模型或参数类型而有很大差异。