pir*_*pir 27 python validation scikit-learn cross-validation
我有一个数据集,以前分为3组:训练,验证和测试.必须使用这些集合以便比较不同算法的性能.
我现在想使用验证集优化我的SVM的参数.但是,我无法找到如何明确输入验证集sklearn.grid_search.GridSearchCV().下面是我之前用于在训练集上进行K折叠交叉验证的一些代码.但是,对于这个问题,我需要使用给定的验证集.我怎样才能做到这一点?
from sklearn import svm, cross_validation
from sklearn.grid_search import GridSearchCV
# (some code left out to simplify things)
skf = cross_validation.StratifiedKFold(y_train, n_folds=5, shuffle = True)
clf = GridSearchCV(svm.SVC(tol=0.005, cache_size=6000,
class_weight=penalty_weights),
param_grid=tuned_parameters,
n_jobs=2,
pre_dispatch="n_jobs",
cv=skf,
scoring=scorer)
clf.fit(X_train, y_train)
Run Code Online (Sandbox Code Playgroud)
yan*_*jie 30
ps = PredefinedSplit(test_fold=your_test_fold)
Run Code Online (Sandbox Code Playgroud)
然后设置cv=ps在GridSearchCV
test_fold:"array-like,shape(n_samples,)
test_fold [i]给出样本i的测试集折叠.值-1表示相应的样本不是任何测试集折叠的一部分,而是总是被放入训练折叠中.
另见这里
使用验证集时,对于属于验证集的所有样本,将test_fold设置为0,对于所有其他样本,将test_fold设置为-1.
小智 12
# Import Libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import PredefinedSplit
# Split Data to Train and Validation
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y,random_state = 2020)
# Create a list where train data indices are -1 and validation data indices are 0
split_index = [-1 if x in X_train.index else 0 for x in X.index]
# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)
# Use PredefinedSplit in GridSearchCV
clf = GridSearchCV(estimator = estimator,
cv=pds,
param_grid=param_grid)
# Fit with all data
clf.fit(X, y)
Run Code Online (Sandbox Code Playgroud)
cgn*_*utt 11
考虑使用我作为作者的hypoptPython包(pip install hypopt).它是专为使用验证集进行参数优化而创建的专业软件包.它适用于任何开箱即用的scikit-learn模型,也可以与Tensorflow,PyTorch,Caffe2等一起使用.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))
Run Code Online (Sandbox Code Playgroud)
编辑:我(我想)在这个回复中得到了-1,因为我建议我创作一个包.这是不幸的,因为该软件包是专门为解决此类问题而创建的.
为了补充@Vinubalan的答案,当训练-有效-测试分割不是用Scikit-learn的train_test_split()函数完成时,即数据帧已经预先手动分割并缩放/归一化,以防止训练数据泄漏,numpy数组可以串联。
import numpy as np
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
from sklearn.model_selection import PredefinedSplit, GridSearchCV
split_index = [-1]*len(X_train) + [0]*len(X_val)
X = np.concatenate((X_train, X_val), axis=0)
y = np.concatenate((y_train, y_val), axis=0)
pds = PredefinedSplit(test_fold = split_index)
clf = GridSearchCV(estimator = estimator,
cv=pds,
param_grid=param_grid)
# Fit with all data
clf.fit(X, y)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
9553 次 |
| 最近记录: |