带管道的岭回归网格搜索

cep*_*pel 5 python regression machine-learning grid-search

我正在尝试优化岭回归的超参数。还要添加多项式特征。因此,管道看起来不错,但在尝试 gridsearchcv 时出错。这里:

# Importing the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from collections import Counter
from IPython.core.display import display, HTML
sns.set_style('darkgrid')

# Data Preprocessing 
from sklearn.datasets import load_boston
boston_dataset = load_boston()
dataset = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)
dataset['MEDV'] = boston_dataset.target

# X and y Variables
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values.reshape(-1,1)

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 25)

# Building the Model ------------------------------------------------------------------------

# Fitting regressior to the Training set
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

steps = [
    ('scalar', StandardScaler()),
    ('poly', PolynomialFeatures(degree=2)),
    ('model', Ridge())
]

ridge_pipe = Pipeline(steps)
ridge_pipe.fit(X_train, y_train)
# Predicting the Test set results
y_pred = ridge_pipe.predict(X_test)

# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = ridge_pipe, X = X_train, y = y_train, cv = 10)
accuracies.mean()
#accuracies.std()

# Applying Grid Search to find the best model and the best parameters
from sklearn.model_selection import GridSearchCV

parameters = [ {'alpha': np.arange(0, 0.2, 0.01) } ]

grid_search = GridSearchCV(estimator = ridge_pipe, 
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)  # <-- GETTING ERROR IN HERE
Run Code Online (Sandbox Code Playgroud)

错误:

ValueError: Invalid parameter ridge for estimator
Run Code Online (Sandbox Code Playgroud)

该怎么做,或者,是否有更好的方法将岭回归与管道一起使用?如果提供一些关于 gridsearch 的资源,我会很高兴,因为我是这方面的新手。错误:

Moh*_*hif 6

您的代码中有两个问题。首先,由于您使用的是 a pipeline,因此您需要在params列表中指定参数属于管道的哪一部分。请参阅官方文档了解更多信息:

\n\n
\n

管道的目的是组装几个可以在设置不同参数时一起交叉验证的步骤。为此,\n 它允许使用参数名称设置各个步骤的参数\n 以及由 \xe2\x80\x98__\xe2\x80\x99 分隔的参数名称,如下例所示

\n
\n\n

在这种情况下,由于alpha将与 一起使用ridge-regression并且您已经在 Pipeline 定义中使用了该字符串model,因此您需要将键重命名alphamodel_alpha

\n\n
steps = [\n    (\'scalar\', StandardScaler()),\n    (\'poly\', PolynomialFeatures(degree=2)),\n    (\'model\', Ridge())  # <------ Whatever string you assign here will be used later\n]\n\n# Since you have named it as \'model\', you need change it to \'model_alpha\'\nparameters = [ {\'model__alpha\': np.arange(0, 0.2, 0.01) } ]\n
Run Code Online (Sandbox Code Playgroud)\n\n

接下来,您需要了解该数据集是用于回归的。您不应该accuracy在此处使用,而是使用基于回归的评分函数,例如mean_squared_error. 以下是您可以使用的一些其他回归指标。像这样的东西

\n\n
from sklearn.metrics import mean_squared_error, make_scorer\nscoring_func = make_scorer(mean_squared_error)\n\ngrid_search = GridSearchCV(estimator = ridge_pipe, \n                           param_grid = parameters,\n                           scoring = scoring_func,  #<--- Use the scoring func defined above\n                           cv = 10,\n                           n_jobs = -1)\n
Run Code Online (Sandbox Code Playgroud)\n\n

这是带有工作代码的Google colab 笔记本的链接。

\n