Jas*_*son 23 python pandas scikit-learn
我试图使用Scikit-learn的Stratified Shuffle Split来分割样本数据集.我跟着Scikit学习文档上显示的例子在这里
import pandas as pd
import numpy as np
# UCI's wine dataset
wine = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
# separate target variable from dataset
target = wine['quality']
data = wine.drop('quality',axis = 1)
# Stratified Split of train and test data
from sklearn.cross_validation import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(target, n_iter=3, test_size=0.2)
for train_index, test_index in sss:
xtrain, xtest = data[train_index], data[test_index]
ytrain, ytest = target[train_index], target[test_index]
# Check target series for distribution of classes
ytrain.value_counts()
ytest.value_counts()
Run Code Online (Sandbox Code Playgroud)
但是,运行此脚本时,我收到以下错误:
IndexError: indices are out-of-bounds
Run Code Online (Sandbox Code Playgroud)
有人可以指出我在这里做错了什么吗?谢谢!
Mar*_*son 46
您正在DataFrame遇到Pandas 索引与NumPy ndarray索引的不同约定.数组train_index和test_index行索引的集合.但是data是一个Pandas DataFrame对象,当你在该对象中使用单个索引时data[train_index],Pandas期望train_index包含列标签而不是行索引.您可以使用以下方法将数据帧转换为NumPy数组.values:
data_array = data.values
for train_index, test_index in sss:
xtrain, xtest = data_array[train_index], data_array[test_index]
ytrain, ytest = target[train_index], target[test_index]
Run Code Online (Sandbox Code Playgroud)
或使用熊猫.iloc存取器:
for train_index, test_index in sss:
xtrain, xtest = data.iloc[train_index], data.iloc[test_index]
ytrain, ytest = target[train_index], target[test_index]
Run Code Online (Sandbox Code Playgroud)
我倾向于第二种方法,因为它提供了xtrain与xtest类型DataFrame,而不是ndarray等保持列标签.
| 归档时间: |
|
| 查看次数: |
10885 次 |
| 最近记录: |