Scikit-learn train_test_split带索引

Cen*_*tAu 48 python classification scipy scikit-learn

使用train_test_split()时如何获取数据的原始索引?

我所拥有的是以下内容

from sklearn.cross_validation import train_test_split
import numpy as np
data = np.reshape(np.randn(20),(10,2)) # 10 training examples
labels = np.random.randint(2, size=10) # 10 labels
x1, x2, y1, y2 = train_test_split(data, labels, size=0.2)
Run Code Online (Sandbox Code Playgroud)

但这并没有给出原始数据的索引.一种解决方法是将索引添加到数据(例如data = [(i, d) for i, d in enumerate(data)]),然后将其传递到内部train_test_split,然后再次展开.有没有更清洁的解决方案?

ogr*_*sel 75

您可以像Julien所说的那样使用pandas数据帧或系列,但如果您想将自己限制为numpy,则可以传递额外的索引数组:

from sklearn.model_selection import train_test_split
import numpy as np
n_samples, n_features, n_classes = 10, 2, 2
data = np.random.randn(n_samples, n_features)  # 10 training examples
labels = np.random.randint(n_classes, size=n_samples)  # 10 labels
indices = np.arange(n_samples)
x1, x2, y1, y2, idx1, idx2 = train_test_split(
    data, labels, indices, test_size=0.2)
Run Code Online (Sandbox Code Playgroud)

  • 实际上这应该是公认的响应,因为它不使用任何额外的包而是sklearn.它可以更好地控制大熊猫的情况. (8认同)
  • `train_test_split`现在在`sklearn.model_selection`中 (2认同)

Jul*_*rec 35

Scikit学习与熊猫玩得很好,所以我建议你使用它.这是一个例子:

In [1]: 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
data = np.reshape(np.random.randn(20),(10,2)) # 10 training examples
labels = np.random.randint(2, size=10) # 10 labels

In [2]: 
X = pd.DataFrame(data)
y = pd.Series(labels)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=test_size, 
                                                    random_state=0)

In [4]: X_test
Out[4]:

     0       1
2   -1.39   -1.86
8    0.48   -0.81
4   -0.10   -1.83

In [5]: y_test
Out[5]:

2    1
8    1
4    1
dtype: int32
Run Code Online (Sandbox Code Playgroud)

您可以直接调用DataFrame/Series上的任何scikit函数,它将起作用.

假设您想要进行LogisticRegression,这里是如何以一种很好的方式检索系数:

In [6]: 
from sklearn.linear_model import LogisticRegression

model = linear_model.LogisticRegression()
model = model.fit(X_train, y_train)

# Retrieve coefficients: index is the feature name ([0,1] here)
df_coefs = pd.DataFrame(model.coef_[0], index=X.columns, columns = ['Coefficient'])
df_coefs
Out[6]:
    Coefficient
0   0.076987
1   -0.352463
Run Code Online (Sandbox Code Playgroud)

  • 如果我没有先创建索引就已经拆分了数据怎么办? (4认同)
  • 我编辑了我的答案,以展示如何使用来自pandas数据帧的功能名称检索系数.未来可能会节省一点时间. (2认同)

m3h*_*h0w 6

这是最简单的解决方案(Jibwa 在另一个答案中使它看起来很复杂),而不必自己生成索引 - 只需使用 ShuffleSplit 对象来生成 1 个拆分。

import numpy as np 
from sklearn.model_selection import ShuffleSplit # or StratifiedShuffleSplit
sss = ShuffleSplit(n_splits=1, test_size=0.1)

data_size = 100
X = np.reshape(np.random.rand(data_size*2),(data_size,2))
y = np.random.randint(2, size=data_size)

sss.get_n_splits(X, y)
train_index, test_index = next(sss.split(X, y)) 

X_train, X_test = X[train_index], X[test_index] 
y_train, y_test = y[train_index], y[test_index]
Run Code Online (Sandbox Code Playgroud)