自定义交叉验证拆分sklearn

use*_*210 2 python validation machine-learning scikit-learn cross-validation

我正在尝试拆分交叉验证的数据集和sklearn中的GridSearch.我想定义自己的拆分,但GridSearch只采用内置的交叉验证方法.

但是,我不能使用内置的交叉验证方法,因为我需要某些示例组在同一个折叠中.所以,如果我有例子:[A1,A2,A3,A4,A5,B1,B2,B3,C1,C2,C3,C4,......,Z1,Z2,Z3]

我想进行交叉验证,使得每个组[A,B,C ...]中的示例仅存在于一个折叠中.

即K1包含[D,E,G,J,K ...],K2包含[A,C,L,M,...],K3包含[B,F,I,...]等

eic*_*erg 12

通常可以使用这种类型的东西sklearn.cross_validation.LeaveOneLabelOut.您只需构建一个对您的组进行编码的标签向量.即,所有样品都K1将采用标签1,所有样品K2将采用标签2,依此类推.

这是一个完全可运行的假数据示例.重要的是创建cv对象的行,以及对它的调用cross_val_score

import numpy as np

n_features = 10

# Make some data
A = np.random.randn(3, n_features)
B = np.random.randn(5, n_features)
C = np.random.randn(4, n_features)
D = np.random.randn(7, n_features)
E = np.random.randn(9, n_features)

# Group it
K1 = np.concatenate([A, B])
K2 = np.concatenate([C, D])
K3 = E

data = np.concatenate([K1, K2, K3])

# Make some dummy prediction target
target = np.random.randn(len(data)) > 0

# Make the corresponding labels
labels = np.concatenate([[i] * len(K) for i, K in enumerate([K1, K2, K3])])

from sklearn.cross_validation import LeaveOneLabelOut, cross_val_score

cv = LeaveOneLabelOut(labels)

# Use some classifier in crossvalidation on data
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
scores = cross_val_score(lr, data, target, cv=cv)
Run Code Online (Sandbox Code Playgroud)

但是,您可能会遇到想要完全手动定义折叠的情况.在这种情况下,您需要创建一个iterable(例如a list)夫妇(train, test),通过索引指示要进入您的火车的样本和每个折叠的测试集.我们来检查一下:

# create train and test folds from our labels:
cv_by_hand = [(np.where(labels != label)[0], np.where(labels == label)[0])
               for label in np.unique(labels)]

# We check this against our existing cv by converting the latter to a list
cv_to_list = list(cv)

print cv_by_hand
print cv_to_list

# Check equality
for (train1, test1), (train2, test2) in zip(cv_by_hand, cv_to_list):
    assert (train1 == train2).all() and (test1 == test2).all()

# Use the created cv_by_hand in cross validation
scores2 = cross_val_score(lr, data, target, cv=cv_by_hand)


# assert equality again
assert (scores == scores2).all()
Run Code Online (Sandbox Code Playgroud)