如何在Python中将数据表数据帧拆分为训练和测试数据集

Question

如何在Python中将数据表数据帧拆分为训练和测试数据集

ibr*_*bra 5 python dataframe pandas train-test-split

我正在使用数据表数据框。如何将数据帧拆分为训练数据集和测试数据集？
\n与 pandas dataframe 类似，我尝试使用train_test_split(dt_df,classes)sklearn.model_selection，但它不起作用并且出现错误。

\n

import datatable as dt\nimport numpy as np\nfrom sklearn.model_selection import train_test_split\n\ndt_df = dt.fread(csv_file_path)\nclasse = dt_df[:, "classe"])\ndel dt_df[:, "classe"])\n\nX_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)\n

Run Code Online (Sandbox Code Playgroud)\n

我收到以下错误： TypeError: 列选择器必须是整数或字符串，而不是 <class \'numpy.ndarray\'>

\n

我尝试通过将数据帧转换为 numpy 数组来解决方法：

\n

classe = np.ravel(dt_df[:, "classe"])\ndt_df = dt_df.to_numpy()\n

Run Code Online (Sandbox Code Playgroud)\n

就像这样它可以工作，但是，我不知道是否有一种方法可以train_test_split像 pandas 数据帧一样正常工作。

\n

编辑1： csv文件包含列字符串，并且值是无符号整数。使用print(dt_df)我们得到：

\n

\n | CCC CCG CCU CCA CGC CGG CGU CGA CUC CUG \xe2\x80\xa6 \n---- + --- --- --- --- --- --- --- --- --- --- \n 0 | 0 0 0 0 2 0 1 0 0 1 \xe2\x80\xa6 \n 1 | 0 0 0 0 1 0 2 1 0 1 \xe2\x80\xa6 \n 2 | 0 0 0 1 1 0 1 0 1 2 \xe2\x80\xa6 \n 3 | 0 0 0 1 1 0 1 0 1 2 \xe2\x80\xa6 \n 4 | 0 0 0 1 1 0 1 0 1 2 \xe2\x80\xa6 \n 5 | 0 0 0 1 1 0 1 0 1 2 \xe2\x80\xa6 \n 6 | 0 0 0 1 0 0 3 0 0 2 \xe2\x80\xa6 \n 7 | 0 0 0 1 1 0 0 0 1 2 \xe2\x80\xa6 \n 8 | 0 0 0 1 1 0 1 0 1 2 \xe2\x80\xa6 \n 9 | 0 0 1 0 1 0 1 0 1 3 \xe2\x80\xa6 \n 10 | 0 0 1 0 1 0 1 0 1 3 \xe2\x80\xa6 \n ...\n

\n

谢谢你的帮助。

\n

Answer 1

小智 8

这是我仅使用 pandas 制作的一个简单函数。样本函数随机且均匀地选择数据框中的行（轴=0）作为测试集。可以通过删除原始数据框中与测试集具有相同索引的行来选择训练集的行。

def train_test_split(df, frac=0.2):
    
    # get random sample 
    test = df.sample(frac=frac, axis=0)

    # get everything but the test sample
    train = df.drop(index=test.index)

    return train, test

Run Code Online (Sandbox Code Playgroud)

Answer 2

ibr*_*bra 0

我使用 sklearn.model_selection 将数据表数据帧拆分为 python 中的训练和测试数据集的解决方案 train_test_split(dt_df,classes)是将数据表数据帧转换为 numpy，如我在问题帖子中提到的，或转换为 pandas 数据帧，如 @Manoor Hassan 评论的那样（往返）再次）：

split方法之前的源代码：

import datatable as dt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier

dt_df = dt.fread(csv_file_path)

classe = np.ravel(dt_df[:, "classe"])
del dt_df[:, "classe"])

Run Code Online (Sandbox Code Playgroud)

split方法后的源码：

ExTrCl = ExtraTreesClassifier()
ExTrCl.fit(X_train, y_train)
pred_test = ExTrCl.predict(X_test)

Run Code Online (Sandbox Code Playgroud)

方法1：转换为numpy

# source code before split method

dt_df = dt_df.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

# source code after split method

Run Code Online (Sandbox Code Playgroud)

方法2：转换为numpy并在分割后返回数据表数据帧：

# source code before split method

dt_df = dt_df.to_numpy()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

X_train = dt.Frame(X_train)

# source code after split method

Run Code Online (Sandbox Code Playgroud)

方法3：转换为pandas数据框

# source code before split method

dt_df = dt_df.to_pandas()

X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test_size)

# source code after split method

Run Code Online (Sandbox Code Playgroud)

这 3 种方法工作正常，但训练 (ExTrCl.fit) 和预测 (ExTrCl.predict) 的时间性能存在差异，对于大约500 Mo的 csv 文件，我得到以下结果：

                       T 转换 T.train T.pred
M1 to_numpy 3 85 0.5
M2 到_numpy 并返回 3.5 29 0.5
M3 转 熊猫 4 37 4

归档时间：	5 年，4 月前
查看次数：	16151 次
最近记录：	3 年，10 月前