使用带scikit-learn交叉验证的statsmodel估计,是否可能?

CAR*_*man 13 python scikit-learn cross-validation statsmodels

我将此问题发布到Cross Validated论坛,后来意识到这可能会在stackoverlfow中找到合适的受众.

我正在寻找一种方法,我可以使用fitpython statsmodel中提供的对象(结果)来提供cross_val_scorescikit-learn cross_validation方法?所附的链接表明它可能是可能的,但我没有成功.

我收到以下错误

estimator应该是一个实现'fit'方法的估计器statsmodels.discrete.discrete_model.BinaryResultsWrapper对象在0x7fa6e801c590被传递

请参阅此链接

Dav*_*ale 20

实际上,由于接口的不同,你不能cross_val_score直接在statsmodels对象上使用:在statsmodels中

  • 训练数据直接传递给构造函数
  • 单独的对象包含模型估计的结果

但是,您可以编写一个简单的包装器来使statsmodels对象看起来像sklearn估算器:

import statsmodels.api as sm
from sklearn.base import BaseEstimator, RegressorMixin

class SMWrapper(BaseEstimator, RegressorMixin):
    """ A universal sklearn-style wrapper for statsmodels regressors """
    def __init__(self, model_class, fit_intercept=True):
        self.model_class = model_class
        self.fit_intercept = fit_intercept
    def fit(self, X, y):
        if self.fit_intercept:
            X = sm.add_constant(X)
        self.model_ = self.model_class(y, X)
        self.results_ = self.model_.fit()
    def predict(self, X):
        if self.fit_intercept:
            X = sm.add_constant(X)
        return self.results_.predict(X)
Run Code Online (Sandbox Code Playgroud)

这个类包含正确fitpredict方法,并且可以与被使用sklearn,例如,交叉验证或包括到管线.像这儿:

from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

X, y = make_regression(random_state=1, n_samples=300, noise=100)

print(cross_val_score(SMWrapper(sm.OLS), X, y, scoring='r2'))
print(cross_val_score(LinearRegression(), X, y, scoring='r2'))
Run Code Online (Sandbox Code Playgroud)

您可以看到两个模型的输出是相同的,因为它们都是OLS模型,以相同的方式交叉验证.

[0.28592315 0.37367557 0.47972639]
[0.28592315 0.37367557 0.47972639]
Run Code Online (Sandbox Code Playgroud)


And*_*ldo 7

按照David建议(这给了我一个错误,抱怨缺少函数get_parameters)和scikit 学习文档,我为线性回归创建了以下包装器。它具有与 相同的接口,sklearn.linear_model.LinearRegression但另外还有函数summary(),它提供有关 p 值、R2 和其他统计信息的信息,如statsmodels.OLS

import statsmodels.api as sm
from sklearn.base import BaseEstimator, RegressorMixin
import pandas as pd
import numpy as np

from sklearn.utils.multiclass import check_classification_targets
from sklearn.utils.validation import check_X_y, check_is_fitted, check_array
from sklearn.utils.multiclass import unique_labels
from sklearn.utils.estimator_checks import check_estimator



class MyLinearRegression(BaseEstimator, RegressorMixin):
    def __init__(self, fit_intercept=True):

        self.fit_intercept = fit_intercept


    """
    Parameters
    ------------
    column_names: list
            It is an optional value, such that this class knows 
            what is the name of the feature to associate to 
            each column of X. This is useful if you use the method
            summary(), so that it can show the feature name for each
            coefficient
    """ 
    def fit(self, X, y, column_names=() ):

        if self.fit_intercept:
            X = sm.add_constant(X)

        # Check that X and y have correct shape
        X, y = check_X_y(X, y)


        self.X_ = X
        self.y_ = y

        if len(column_names) != 0:
            cols = column_names.copy()
            cols = list(cols)
            X = pd.DataFrame(X)
            cols = column_names.copy()
            cols.insert(0,'intercept')
            print('X ', X)
            X.columns = cols

        self.model_ = sm.OLS(y, X)
        self.results_ = self.model_.fit()
        return self



    def predict(self, X):
        # Check is fit had been called
        check_is_fitted(self, 'model_')

        # Input validation
        X = check_array(X)

        if self.fit_intercept:
            X = sm.add_constant(X)
        return self.results_.predict(X)


    def get_params(self, deep = False):
        return {'fit_intercept':self.fit_intercept}


    def summary(self):
        print(self.results_.summary() )
Run Code Online (Sandbox Code Playgroud)

使用示例:

cols = ['feature1','feature2']
X_train = df_train[cols].values
X_test = df_test[cols].values
y_train = df_train['label']
y_test = df_test['label']
model = MyLinearRegression()
model.fit(X_train, y_train)
model.summary()
model.predict(X_test)
Run Code Online (Sandbox Code Playgroud)

如果要显示列的名称,可以调用

model.fit(X_train, y_train, column_names=cols)
Run Code Online (Sandbox Code Playgroud)

在 cross_validation 中使用它:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(MyLinearRegression(), X_train, y_train, cv=10, scoring='neg_mean_squared_error')
scores
Run Code Online (Sandbox Code Playgroud)

  • 在最后一条评论“在 cross_validation 中使用它”中,为什么在 cross_val_score 中使用 X_train 和 y_train 而不是仅使用 X 和 y ? (2认同)
  • 因为我考虑以下协议:(i)划分训练集和测试集的样本(ii)选择最佳模型,即给出最高交叉验证分数的模型,仅使用训练集,以避免任何数据泄漏(iii) 检查此类模型在测试集中包含的“看不见的”数据上的性能。如果您使用整个数据集进行交叉验证,您将根据判断模型的相同数据来选择模型。从技术上讲,这将是数据泄露。事实上,它不会告诉你你的模型如何处理完全不可见的数据。 (2认同)

cyl*_*lim 6

作为参考,如果您使用statsmodels公式API和/或使用该fit_regularized方法,您可以通过这种方式修改@David Dale的包装类。

import pandas as pd
from sklearn.base import BaseEstimator, RegressorMixin
from statsmodels.formula.api import glm as glm_sm

# This is an example wrapper for statsmodels GLM
class SMWrapper(BaseEstimator, RegressorMixin):
    def __init__(self, family, formula, alpha, L1_wt):
        self.family = family
        self.formula = formula
        self.alpha = alpha
        self.L1_wt = L1_wt
        self.model = None
        self.result = None
    def fit(self, X, y):
        data = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)
        data.columns = X.columns.tolist() + ['y']
        self.model = glm_sm(self.formula, data, family=self.family)
        self.result = self.model.fit_regularized(alpha=self.alpha, L1_wt=self.L1_wt, refit=True)
        return self.result
    def predict(self, X):
        return self.result.predict(X)
Run Code Online (Sandbox Code Playgroud)