use*_*er0 11 python numpy scikit-learn
在GroupKFold源中,random_state设置为None
def __init__(self, n_splits=3):
super(GroupKFold, self).__init__(n_splits, shuffle=False,
random_state=None)
Run Code Online (Sandbox Code Playgroud)
因此,多次运行时(代码来自这里)
import numpy as np
from sklearn.model_selection import GroupKFold
for i in range(0,10):
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)
print(group_kfold)
for train_index, test_index in group_kfold.split(X, y, groups):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)
print
print
Run Code Online (Sandbox Code Playgroud)
O/P
GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
[7, 8]]), array([[1, 2],
[3, 4]]), array([3, 4]), array([1, 2]))
GroupKFold(n_splits=2)
('TRAIN:', array([0, 1]), 'TEST:', array([2, 3]))
(array([[1, 2],
[3, 4]]), array([[5, 6],
[7, 8]]), array([1, 2]), array([3, 4]))
('TRAIN:', array([2, 3]), 'TEST:', array([0, 1]))
(array([[5, 6],
[7, 8]]), array([[1, 2],
[3, 4]]), array([3, 4]), array([1, 2]))
Run Code Online (Sandbox Code Playgroud)
等......
分裂是相同的.
如何设置random_statefor GroupKFold以便在交叉验证的几个不同试验中获得不同(但可重复)的分组集?
我想要
GroupKFold(n_splits=2, random_state=42)
('TRAIN:', array([0, 1]),
'TEST:', array([2, 3]))
('TRAIN:', array([2, 3]),
'TEST:', array([0, 1]))
GroupKFold(n_splits=2, random_state=13)
('TRAIN:', array([0, 2]),
'TEST:', array([1, 3]))
('TRAIN:', array([1, 3]),
'TEST:', array([0, 2]))
Run Code Online (Sandbox Code Playgroud)
到目前为止,这似乎是一个策略可能是使用sklearn.utils.shuffle第一,在此建议后.然而,这实际上只是重新排列每个折叠的元素 - 它不会给我们新的分裂.
from sklearn.utils import shuffle
from sklearn.model_selection import GroupKFold
import numpy as np
import sys
import pdb
random_state = int(sys.argv[1])
X = np.arange(20).reshape((10,2))
y = np.arange(10)
groups = np.array([0,0,0,1,2,3,4,5,6,7])
def cv(X, y, groups, random_state):
X_s, y_s, groups_s = shuffle(X,y, groups, random_state=random_state)
cv_out = GroupKFold(n_splits=2)
cv_out_splits = cv_out.split(X_s, y_s, groups_s)
for train, test in cv_out_splits:
print "---"
print X_s[test]
print y_s[test]
print "test groups", groups_s[test]
print "train groups", groups_s[train]
pdb.set_trace()
print "***"
cv(X, y, groups, random_state)
Run Code Online (Sandbox Code Playgroud)
输出:
>python sshuf.py 32
***
---
[[ 2 3]
[ 4 5]
[ 0 1]
[ 8 9]
[12 13]]
[1 2 0 4 6]
test groups [0 0 0 2 4]
train groups [7 6 1 3 5]
---
[[18 19]
[16 17]
[ 6 7]
[10 11]
[14 15]]
[9 8 3 5 7]
test groups [7 6 1 3 5]
train groups [0 0 0 2 4]
>python sshuf.py 234
***
---
[[12 13]
[ 4 5]
[ 0 1]
[ 2 3]
[ 8 9]]
[6 2 0 1 4]
test groups [4 0 0 0 2]
train groups [7 3 1 5 6]
---
[[18 19]
[10 11]
[ 6 7]
[14 15]
[16 17]]
[9 5 3 7 8]
test groups [7 3 1 5 6]
train groups [4 0 0 0 2]
Run Code Online (Sandbox Code Playgroud)
KFold只是随机的shuffle=True.一些数据集不应该被洗牌.GroupKFold根本不是随机的.因此random_state=None.GroupShuffleSplit 可能更接近你正在寻找的东西.基于组的分离器的比较:
GroupKFold,测试集形成所有数据的完整分区.LeavePGroupsOut将所有可能的P组子集组合出来,组合起来; 对于P> 1,测试集将重叠.因为这意味着P ** n_groups完全分裂,通常你想要一个小的P,并且最常想要的LeaveOneGroupOut是GroupKFold与它基本相同的k=1.GroupShuffleSplit没有说明连续测试集之间的关系; 每个列车/测试拆分是独立执行的.另外,Dmytro Lituiev 提出了一种替代GroupShuffleSplit算法,它可以更好地在指定的测试集中获得正确数量的样本(不仅仅是正确数量的组)test_size.
灵感来自 user0 的回答(无法评论)但速度更快:
def RandomGroupKFold_split(groups, n, seed=None): # noqa: N802
"""
Random analogous of sklearn.model_selection.GroupKFold.split.
:return: list of (train, test) indices
"""
groups = pd.Series(groups)
ix = np.arange(len(groups))
unique = np.unique(groups)
np.random.RandomState(seed).shuffle(unique)
result = []
for split in np.array_split(unique, n):
mask = groups.isin(split)
train, test = ix[~mask], ix[mask]
result.append((train, test))
return result
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2870 次 |
| 最近记录: |