在GridSearchCV中明确指定测试/训练集

Bra*_*mon 5 python scikit-learn grid-search

cv对sklearn的参数有疑问GridSearchCV

我正在处理具有时间成分的数据,因此我认为在KFold交叉验证中进行随机混洗似乎并不明智。

取而代之的是,我想在中明确指定训练,验证和测试数据的临界值GridSearchCV。我可以这样做吗?

为了更好地阐明问题,以下是我手动解决的方法。

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
np.random.seed(444)

index = pd.date_range('2014', periods=60, freq='M')
X, y = make_regression(n_samples=60, n_features=3, random_state=444, noise=90.)
X = pd.DataFrame(X, index=index, columns=list('abc'))
y = pd.Series(y, index=index, name='y')

# Train on the first 30 samples, validate on the next 10, test on
#     the final 10.
X_train, X_val, X_test = np.array_split(X, [35, 50])
y_train, y_val, y_test = np.array_split(y, [35, 50])

param_grid = {'alpha': np.linspace(0, 1, 11)}
model = None
best_param_ = None
best_score_ = -np.inf

# Manual implementation
for alpha in param_grid['alpha']:
    ridge = Ridge(random_state=444, alpha=alpha).fit(X_train, y_train)
    score = ridge.score(X_val, y_val)
    if score > best_score_:
        best_score_ = score
        best_param_ = alpha
        model = ridge

print('Optimal alpha parameter: {:0.2f}'.format(best_param_))
print('Best score (on validation data): {:0.2f}'.format(best_score_))
print('Test set score: {:.2f}'.format(model.score(X_test, y_test)))
# Optimal alpha parameter: 1.00
# Best score (on validation data): 0.64
# Test set score: 0.22
Run Code Online (Sandbox Code Playgroud)

这里的过程是:

  • 对于X和Y,我都需要训练集,验证集和测试集。训练集是时间序列中的前35个样本。验证集是接下来的15个样本。测试集是最终的10个。
  • 训练集和验证集用于确定alphaRidge回归内的最佳参数。在这里,我测试alpha了(0.0,0.1,...,0.9,1.0)的s。
  • 测试集作为“实际”测试而保留,是看不见的数据。

无论如何...看来我正在寻找类似的方法,但不确定传递给cv这里的内容:

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv= ???)
grid_search.fit(...?)
Run Code Online (Sandbox Code Playgroud)

我在解释时遇到问题的文档指定:

cv :int,交叉验证生成器或可迭代的,可选的

确定交叉验证拆分策略。简历的可能输入是:

  • 无,要使用默认的三折交叉验证,
  • 整数,用于指定(分层)KFold中的折叠数,
  • 用作交叉验证生成器的对象。
  • 可迭代的屈服火车,测试分裂。

对于整数/无输入,如果估计量是分类器,y是二进制或多类,则使用StratifiedKFold。在所有其他情况下,都使用KFold。

Viv*_*mar 10

正如@MaxU所说的,最好让GridSearchCV处理拆分,但是如果您要按照问题中的设置执行拆分,则可以使用PredefinedSplitwhich来完成此工作。

因此,您需要对代码进行以下更改。

# Here X_test, y_test is the untouched data
# Validation data (X_val, y_val) is currently inside X_train, which will be split using PredefinedSplit inside GridSearchCV
X_train, X_test = np.array_split(X, [50])
y_train, y_test = np.array_split(y, [50])


# The indices which have the value -1 will be kept in train.
train_indices = np.full((35,), -1, dtype=int)

# The indices which have zero or positive values, will be kept in test
test_indices = np.full((15,), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)

print(test_fold)
# OUTPUT: 
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0])

from sklearn.model_selection import PredefinedSplit
ps = PredefinedSplit(test_fold)

# Check how many splits will be done, based on test_fold
ps.get_n_splits()
# OUTPUT: 1

for train_index, test_index in ps.split():
    print("TRAIN:", train_index, "TEST:", test_index)

# OUTPUT: 
('TRAIN:', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
   17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
   34]), 
 'TEST:', array([35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]))


# And now, send this `ps` to cv param in GridSearchCV
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv=ps)

# Here, send the X_train and y_train
grid_search.fit(X_train, y_train)
Run Code Online (Sandbox Code Playgroud)

发送到X_train和y_train的代码fit()将使用我们定义的拆分方式拆分为Train和test(在您的情况下为val),因此,Ridge将根据来自索引[0:35]的原始数据进行训练,并于[35:50]进行测试。

希望这能清除工作。

  • 您也可以在此处执行`test_fold = np.repeat([-1,0],[35,15])`以节省几行 (2认同)

Ber*_*man 6

你试过TimeSeriesSplit吗?

它是明确用于拆分时间序列数据的。

tscv = TimeSeriesSplit(n_splits=3)
grid_search = GridSearchCV(clf, param_grid, cv=tscv.split(X))
Run Code Online (Sandbox Code Playgroud)