ami*_*jad 13 python machine-learning dataset scikit-learn
我有一个大约 2m 观测值的数据集,我需要以 60:20:20 的比例将其拆分为训练、验证和测试集。我的数据集的简化摘录如下所示:
+---------+------------+-----------+-----------+
| note_id | subject_id | category | note |
+---------+------------+-----------+-----------+
| 1 | 1 | ECG | blah ... |
| 2 | 1 | Discharge | blah ... |
| 3 | 1 | Nursing | blah ... |
| 4 | 2 | Nursing | blah ... |
| 5 | 2 | Nursing | blah ... |
| 6 | 3 | ECG | blah ... |
+---------+------------+-----------+-----------+
Run Code Online (Sandbox Code Playgroud)
有多个类别——它们并不均衡——所以我需要确保训练、验证和测试集都具有与原始数据集中相同的类别比例。这部分很好,我可以StratifiedShuffleSplit
从sklearn
库中使用。
但是,我还需要确保每个主题的观察结果不会分散在训练、验证和测试数据集上。来自给定主题的所有观察结果都需要在同一个桶中,以确保我的训练模型在验证/测试之前从未见过该主题。例如,subject_id 1 的每个观察都应该在训练集中。
我想不出一种方法来确保按类别分层拆分,防止跨数据集的subject_id污染(因为想要更好的词),确保 60:20:20 拆分并确保以某种方式对数据集进行混洗。任何帮助,将不胜感激!
谢谢!
编辑:
我现在了解到,也可以sklearn
通过该GroupShuffleSplit
函数完成按类别分组和跨数据集拆分将组保持在一起。所以基本上,我需要的是一个组合的分层和分组洗牌拆分,即StratifiedGroupShuffleSplit
不存在。Github 问题:https : //github.com/scikit-learn/scikit-learn/issues/12076
这个问题在 scikit-learn 1.0 中通过StratifiedGroupKFold解决了
在此示例中,您在洗牌后生成 3 个折叠,将组保持在一起并进行分层(尽可能多)
import numpy as np
from sklearn.model_selection import StratifiedGroupKFold
X = np.ones((30, 2))
y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 1, 1, 1,])
groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5,
5, 5, 6, 6, 7, 8, 8, 9, 9, 9,
10, 11, 11, 12, 12, 12, 13, 13,
13, 13])
print("ORIGINAL POSITIVE RATIO:", y.mean())
cv = StratifiedGroupKFold(n_splits=3, shuffle=True)
for fold, (train_idxs, test_idxs) in enumerate(cv.split(X, y, groups)):
print("Fold :", fold)
print("TRAIN POSITIVE RATIO:", y[train_idxs].mean())
print("TEST POSITIVE RATIO :", y[test_idxs].mean())
print("TRAIN GROUPS :", set(groups[train_idxs]))
print("TEST GROUPS :", set(groups[test_idxs]))
Run Code Online (Sandbox Code Playgroud)
在输出中,您可以看到折叠中正例的比率保持接近原始正例的比率,并且同一组永远不会出现在两个集合中。当然,你拥有的群体越少/越大(即你的班级越不平衡),保持接近原始班级分布就越困难。
输出:
ORIGINAL POSITIVE RATIO: 0.5
Fold : 0
TRAIN POSITIVE RATIO: 0.4375
TEST POSITIVE RATIO : 0.5714285714285714
TRAIN GROUPS : {1, 3, 4, 5, 6, 7, 10, 11}
TEST GROUPS : {2, 8, 9, 12, 13}
Fold : 1
TRAIN POSITIVE RATIO: 0.5
TEST POSITIVE RATIO : 0.5
TRAIN GROUPS : {2, 4, 5, 7, 8, 9, 11, 12, 13}
TEST GROUPS : {1, 10, 3, 6}
Fold : 2
TRAIN POSITIVE RATIO: 0.5454545454545454
TEST POSITIVE RATIO : 0.375
TRAIN GROUPS : {1, 2, 3, 6, 8, 9, 10, 12, 13}
TEST GROUPS : {11, 4, 5, 7}
Run Code Online (Sandbox Code Playgroud)
小智 5
这已经一年多了,但我发现自己处于类似的情况,我有标签和组,并且由于组的性质,一组数据点可以仅在测试中或仅在训练中,我'我使用 pandas 和 sklearn 编写了一个小算法,希望这会有所帮助
from sklearn.model_selection import GroupShuffleSplit
groups = df.groupby('label')
all_train = []
all_test = []
for group_id, group in groups:
# if a group is already taken in test or train it must stay there
group = group[~group['groups'].isin(all_train+all_test)]
# if group is empty
if group.shape[0] == 0:
continue
train_inds, test_inds = next(GroupShuffleSplit(
test_size=valid_size, n_splits=2, random_state=7).split(group, groups=group['groups']))
all_train += group.iloc[train_inds]['groups'].tolist()
all_test += group.iloc[test_inds]['groups'].tolist()
train= df[df['groups'].isin(all_train)]
test= df[df['groups'].isin(all_test)]
form_train = set(train['groups'].tolist())
form_test = set(test['groups'].tolist())
inter = form_train.intersection(form_test)
print(df.groupby('label').count())
print(train.groupby('label').count())
print(test.groupby('label').count())
print(inter) # this should be empty
Run Code Online (Sandbox Code Playgroud)
本质上我需要的StratifiedGroupShuffleSplit
是不存在的(Github问题)。这是因为这样的函数的行为尚不清楚,并且完成此操作以生成既分组又分层的数据集并不总是可能的(也在此处讨论) - 特别是对于像我这样的严重不平衡的数据集。就我而言,我希望严格进行分组,以确保组之间没有任何重叠,同时分层和数据集比例拆分为 60:20:20 ,即尽可能地进行。
正如 Ghanem 提到的,我别无选择,只能自己构建一个函数来分割数据集,如下所示:
def StratifiedGroupShuffleSplit(df_main):
df_main = df_main.reindex(np.random.permutation(df_main.index)) # shuffle dataset
# create empty train, val and test datasets
df_train = pd.DataFrame()
df_val = pd.DataFrame()
df_test = pd.DataFrame()
hparam_mse_wgt = 0.1 # must be between 0 and 1
assert(0 <= hparam_mse_wgt <= 1)
train_proportion = 0.6 # must be between 0 and 1
assert(0 <= train_proportion <= 1)
val_test_proportion = (1-train_proportion)/2
subject_grouped_df_main = df_main.groupby(['subject_id'], sort=False, as_index=False)
category_grouped_df_main = df_main.groupby('category').count()[['subject_id']]/len(df_main)*100
def calc_mse_loss(df):
grouped_df = df.groupby('category').count()[['subject_id']]/len(df)*100
df_temp = category_grouped_df_main.join(grouped_df, on = 'category', how = 'left', lsuffix = '_main')
df_temp.fillna(0, inplace=True)
df_temp['diff'] = (df_temp['subject_id_main'] - df_temp['subject_id'])**2
mse_loss = np.mean(df_temp['diff'])
return mse_loss
i = 0
for _, group in subject_grouped_df_main:
if (i < 3):
if (i == 0):
df_train = df_train.append(pd.DataFrame(group), ignore_index=True)
i += 1
continue
elif (i == 1):
df_val = df_val.append(pd.DataFrame(group), ignore_index=True)
i += 1
continue
else:
df_test = df_test.append(pd.DataFrame(group), ignore_index=True)
i += 1
continue
mse_loss_diff_train = calc_mse_loss(df_train) - calc_mse_loss(df_train.append(pd.DataFrame(group), ignore_index=True))
mse_loss_diff_val = calc_mse_loss(df_val) - calc_mse_loss(df_val.append(pd.DataFrame(group), ignore_index=True))
mse_loss_diff_test = calc_mse_loss(df_test) - calc_mse_loss(df_test.append(pd.DataFrame(group), ignore_index=True))
total_records = len(df_train) + len(df_val) + len(df_test)
len_diff_train = (train_proportion - (len(df_train)/total_records))
len_diff_val = (val_test_proportion - (len(df_val)/total_records))
len_diff_test = (val_test_proportion - (len(df_test)/total_records))
len_loss_diff_train = len_diff_train * abs(len_diff_train)
len_loss_diff_val = len_diff_val * abs(len_diff_val)
len_loss_diff_test = len_diff_test * abs(len_diff_test)
loss_train = (hparam_mse_wgt * mse_loss_diff_train) + ((1-hparam_mse_wgt) * len_loss_diff_train)
loss_val = (hparam_mse_wgt * mse_loss_diff_val) + ((1-hparam_mse_wgt) * len_loss_diff_val)
loss_test = (hparam_mse_wgt * mse_loss_diff_test) + ((1-hparam_mse_wgt) * len_loss_diff_test)
if (max(loss_train,loss_val,loss_test) == loss_train):
df_train = df_train.append(pd.DataFrame(group), ignore_index=True)
elif (max(loss_train,loss_val,loss_test) == loss_val):
df_val = df_val.append(pd.DataFrame(group), ignore_index=True)
else:
df_test = df_test.append(pd.DataFrame(group), ignore_index=True)
print ("Group " + str(i) + ". loss_train: " + str(loss_train) + " | " + "loss_val: " + str(loss_val) + " | " + "loss_test: " + str(loss_test) + " | ")
i += 1
return df_train, df_val, df_test
df_train, df_val, df_test = StratifiedGroupShuffleSplit(df_main)
Run Code Online (Sandbox Code Playgroud)
我基于两件事创建了一些任意损失函数:
损失函数的这两个输入的加权是由静态超参数完成的hparam_mse_wgt
。对于我的特定数据集,值 0.1 效果很好,但如果您使用此函数,我会鼓励您尝试使用它。将其设置为 0 将优先仅维持分流比并忽略分层。将其设置为 1 则反之亦然。
然后,使用此损失函数,我会迭代每个主题(组),并根据损失函数最高的数据集将其附加到适当的数据集(训练、验证或测试)。
它并不是特别复杂,但它适合我。它不一定适用于每个数据集,但数据集越大,机会就越大。希望其他人会发现它有用。
归档时间: |
|
查看次数: |
2376 次 |
最近记录: |