numpy中的分层抽样

sia*_*mii 6 python numpy

在numpy我有一个像这样的数据集.前两列是索引.我可以通过索引将我的数据集划分为块,即第一个块是0 0秒块是0 1个第三个块0 2然后是1个,1个,1个等等,依此类推.每个块至少有两个元素.索引列中的数字可以变化

我需要随机地沿着这些块分割数据集80%-20%,这样在分割后,两个数据集中的每个块都至少有1个元素.我怎么能这样做?

indices | real data
        |
0   0   | 43.25 665.32 ...  } 1st block
0   0   | 11.234            }
0   1     ...               } 2nd block
0   1                       } 
0   2                       } 3rd block
0   2                       }
1   0                       } 4th block
1   0                       }
1   0                       }
1   1                       ...
1   1                       
1   2
1   2
2   0
2   0 
2   1
2   1
2   1
...
Run Code Online (Sandbox Code Playgroud)

Jai*_*ime 5

看看你喜欢这个。为了引入随机性,我正在对整个数据集进行混洗。这是我想出如何进行矢量化分割的唯一方法。也许你可以简单地洗牌一个索引数组,但对我今天的大脑来说,这是一个太多的间接。我还使用了结构化数组,以便于提取块。首先,让我们创建一个示例数据集:

from __future__ import division
import numpy as np

# Create a sample data set
c1, c2 = 10, 5
idx1, idx2 = np.arange(c1), np.arange(c2)
idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)

items = 1000
i = np.random.randint(c1*c2, size=(items - 2*c1*c2,))
d = np.random.rand(items+5)

dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int),
                             ('data', np.float)])
dataset['idx1'][:2*c1*c2] =  np.tile(idx1, 2)
dataset['idx1'][2*c1*c2:-5] = idx1[i]
dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2)
dataset['idx2'][2*c1*c2:-5] = idx2[i]
dataset['data'] = d
# Add blocks with only 2 and only 3 elements to test corner case
dataset['idx1'][-5:] = -1
dataset['idx2'][-5:] = [0] * 2 + [1]*3
Run Code Online (Sandbox Code Playgroud)

现在分层抽样:

# For randomness, shuffle the entire array
np.random.shuffle(dataset)

blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True)
block_count = np.bincount(_)
where = np.argsort(_)
block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))

# If we have n elements in a block, and we assign 1 to each array, we
# are left with only n-2. If we randomly assign a fraction x of these
# to the first array, the expected ratio of items will be
# (x*(n-2) + 1) : ((1-x)*(n-2) + 1)
# Setting the ratio equal to 4 (80/20) and solving for x, we get
# x = 4/5 + 3/5/(n-2)

x = 4/5 + 3/5/(block_count - 2)
x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1
threshold = np.repeat(x, block_count)
threshold[block_start] = 1 # first item goes to A
threshold[block_start + 1] = 0 # seconf item goes to B

a_idx = threshold > np.random.rand(len(dataset))

A = dataset[where[a_idx]]
B = dataset[where[~a_idx]]
Run Code Online (Sandbox Code Playgroud)

运行后,分割率大约为 80/20,所有块都在两个数组中表示:

>>> len(A)
815
>>> len(B)
190
>>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']]))
True
Run Code Online (Sandbox Code Playgroud)