不使用 scikit learn 训练测试拆分

COD*_*DIY 3 python numpy scipy scikit-learn

我有一个房价预测数据集。我必须将数据集拆分为traintest
我想知道是否可以通过使用numpy或来做到这一点scipy
我现在不能使用scikit学习。

Ant*_*jnc 6

我知道你的问题只是用numpyor做一个 train_test_splitscipy但实际上有一种非常简单的方法可以用 Pandas 来做:

import pandas as pd 

# Shuffle your dataset 
shuffle_df = df.sample(frac=1)

# Define a size for your train set 
train_size = int(0.7 * len(df))

# Split your dataset 
train_set = shuffle_df[:train_size]
test_set = shuffle_df[train_size:]
Run Code Online (Sandbox Code Playgroud)

对于那些想要快速简便的解决方案的人。


Viv*_*hta 6

虽然这是一个老问题,但这个答案可能会有所帮助。

这就是 sklearn 实现的方式train_test_split,下面给出的这个方法采用与 sklearn 类似的参数。

import numpy as np
from itertools import chain

def _indexing(x, indices):
    """
    :param x: array from which indices has to be fetched
    :param indices: indices to be fetched
    :return: sub-array from given array and indices
    """
    # np array indexing
    if hasattr(x, 'shape'):
        return x[indices]

    # list indexing
    return [x[idx] for idx in indices]

def train_test_split(*arrays, test_size=0.25, shufffle=True, random_seed=1):
    """
    splits array into train and test data.
    :param arrays: arrays to split in train and test
    :param test_size: size of test set in range (0,1)
    :param shufffle: whether to shuffle arrays or not
    :param random_seed: random seed value
    :return: return 2*len(arrays) divided into train ans test
    """
    # checks
    assert 0 < test_size < 1
    assert len(arrays) > 0
    length = len(arrays[0])
    for i in arrays:
        assert len(i) == length

    n_test = int(np.ceil(length*test_size))
    n_train = length - n_test

    if shufffle:
        perm = np.random.RandomState(random_seed).permutation(length)
        test_indices = perm[:n_test]
        train_indices = perm[n_test:]
    else:
        train_indices = np.arange(n_train)
        test_indices = np.arange(n_train, length)

    return list(chain.from_iterable((_indexing(x, train_indices), _indexing(x, test_indices)) for x in arrays))
Run Code Online (Sandbox Code Playgroud)

当然,sklearn 的实现支持分层 k 折叠、pandas 系列的拆分等。这个仅适用于拆分列表和 numpy 数组,我认为这适用于您的情况。


Jen*_*sen 1

import numpy as np
import pandas as pd

X_data = pd.read_csv('house.csv')
Y_data = X_data["prices"]
X_data.drop(["offers", "brick", "bathrooms", "prices"], 
            axis=1, inplace=True) # important to drop prices as well

# create random train/test split
indices = range(X_data.shape[0])
num_training_instances = int(0.8 * X_data.shape[0])
np.random.shuffle(indices)
train_indices = indices[:num_training_indices]
test_indices = indices[num_training_indices:]

# split the actual data
X_data_train, X_data_test = X_data.iloc[train_indices], X_data.iloc[test_indices]
Y_data_train, Y_data_test = Y_data.iloc[train_indices], Y_data.iloc[test_indices]
Run Code Online (Sandbox Code Playgroud)

这假设您想要随机分割。所发生的情况是,我们正在创建一个索引列表,其长度与您拥有的数据点的数量相同,即 X_data(或 Y_data)的第一个轴。然后,我们将它们按随机顺序排列,并仅将这些随机索引的前 80% 作为训练数据,其余的用于测试。[:num_training_indices]只需从列表中选择第一个 num_training_indices 。之后,您只需使用随机索引列表从数据中提取行,然后数据就会被分割。如果您希望分割可重现(np.random.seed(some_integer)在开始时),请记住从 X_data 中删除价格并设置种子。