使用 sklearn.train_test_split 处理不平衡数据

Question

使用 sklearn.train_test_split 处理不平衡数据

Mar*_*yam 9 training-data python-3.x scikit-learn oversampling imbalanced-data

我有一个非常不平衡的数据集。我使用 sklearn.train_test_split 函数来提取训练数据集。现在我想对训练数据集进行过采样，所以我用来计算 type1 的数量（我的数据集有 2 个类别和类型（type1 和 tupe2），但几乎所有的训练数据都是 type1。所以我不能过采样。

以前我曾经用我编写的代码分割训练测试数据集。在该代码中，所有类型 1 数据的 0.8 和所有类型 2 数据的 0.8 都在训练数据集中。

如何将此方法与 train_test_split 函数或 sklearn 中的其他分割方法一起使用？

*我应该只使用sklearn或我自己编写的方法。

Answer 1

Arn*_*aud 18

你正在寻找分层。为什么？

stratify方法中有一个参数，train_test_split您可以为其提供标签列表，例如：

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.2)

Run Code Online (Sandbox Code Playgroud)

还有StratifiedShuffleSplit。

归档时间：	5 年，9 月前
查看次数：	17702 次
最近记录：	4 年，8 月前