sklearn:用户定义的时间序列数据交叉验证

Dem*_*nov 10 python scikit-learn cross-validation

我正在努力解决机器学习问题.我有一个具有时间序列元素的特定数据集.对于这个问题,我正在使用着名的python库 - sklearn.这个库中有很多交叉验证迭代器.还有几个迭代器可以自己定义交叉验证.问题是我真的不知道如何为时间序列定义简单的交叉验证.这是我想要得到的一个很好的例子:

假设我们有几个句点(年),我们想将我们的数据集分成几个块,如下所示:

data = [1, 2, 3, 4, 5, 6, 7]

train: [1]                test: [2] (or test: [2, 3, 4, 5, 6, 7])
train: [1, 2]             test: [3] (or test: [3, 4, 5, 6, 7])
train: [1, 2, 3]          test: [4] (or test: [4, 5, 6, 7])
...
train: [1, 2, 3, 4, 5, 6] test: [7]
Run Code Online (Sandbox Code Playgroud)

我无法真正理解如何使用sklearn工具创建这种交叉验证.也许我应该用PredefinedSplitsklearn.cross_validation这样的:

train_fraction  = 0.8
train_size      = int(train_fraction * X_train.shape[0])
validation_size = X_train.shape[0] - train_size

cv_split = cross_validation.PredefinedSplit(test_fold=[-1] * train_size + [1] * validation_size)
Run Code Online (Sandbox Code Playgroud)

结果:

train: [1, 2, 3, 4, 5] test: [6, 7]
Run Code Online (Sandbox Code Playgroud)

但它仍然不如先前的数据分割那么好

Dan*_*ață 6

您可以在不使用的情况下获得所需的交叉验证拆分sklearn.这是一个例子

import numpy as np

from sklearn.svm import SVR
from sklearn.feature_selection import RFECV

# Generate some data.
N = 10
X_train = np.random.randn(N, 3)
y_train = np.random.randn(N)

# Define the splits.
idxs = np.arange(N)
cv_splits = [(idxs[:i], idxs[i:]) for i in range(1, N)]

# Create the RFE object and compute a cross-validated score.
svr = SVR(kernel="linear")
rfecv = RFECV(estimator=svr, step=1, cv=cv_splits)
rfecv.fit(X_train, y_train)
Run Code Online (Sandbox Code Playgroud)