如何在不更换总体的情况下快速创建随机样本?

Nic*_*s R 7 python random algorithm numpy python-3.x

我有一个问题,我需要创建m 个大小为n的样本而不进行替换。此外,该样本必须保留总体向量的原始顺序。所有这一切都超级快。

Population = [50, 30, 12, 24, 420, 243, 173, 194, 123, 43, 21, 64, 34...]

300 samples of a combination of 3 
[[24, 21, 34], [50, 194, 21], [12, 173, 64], [30, 173, 194].... [12, 243, 34]]
Run Code Online (Sandbox Code Playgroud)

这些样本必须是独立的,在我的例子中,我需要保留原始总体数组的顺序。有多个可能的答案,但它们都不是很快,它们都是我的代码的瓶颈。我使用 NumPy 来生成随机数。

一些最有前途的方法如下:

  1. 使用 Numpy.random.choice 几乎可以完成这项工作,但只能通过替换来完成,从而生成具有重复数字的样本。这非常快,但随后我需要快速摆脱不良样本。
    gen = np.random.default_rng()
    def random_combination(population, sample, number = 3):
        with_replacement_samples = gen.choice(len(population), size=(sample, number))
        pairs = np.sort(with_replacement_samples)
        positions= positions[pairs]

        for i in positions:

            if i[0] == i[2] or i[0] == i[1] or i[1]== i[2]:
               continue #I would need to generate new sample each time ... #if is expensive


            yield I

Run Code Online (Sandbox Code Playgroud)
  1. 另一种方法是使用我在另一个答案中找到的这种方法,但它非常慢
def random_combination4(posiciones, sample, number = 3):
    pair = np.argpartition(gen.random((sample, len(posiciones))), number - 1, axis=-1)[:, :number]
    pair = np.sort(pair)
    for i in posiciones[pair]:
        yield I

Run Code Online (Sandbox Code Playgroud)
  1. 最后一个有趣的方法是使用这个家伙方法,但这个解决方案是在 nummpy 用随机数解决其性能问题之前。
def random_combination(population, sample, number = 3, probabilities =  None):
    if probabilities is None:
        replicated_probabilities = np.tile( np.full(shape=num_elements,fill_value=1/num_elements), (num_samples, 1))
    else:
        replicated_probabilities = np.tile(probabilities, (num_samples, 1))
    # replicate probabilities as many times as `num_samples`
    # get random shifting numbers & scale them correctly
    random_shifts = gen.random(replicated_probabilities.shape)
    random_shifts /= random_shifts.sum(axis=1)[:, np.newaxis]
    # shift by numbers & find largest (by finding the smallest of the negative)
    shifted_probabilities = random_shifts - replicated_probabilities
    combinations = np.sort( np.argpartition(shifted_probabilities, sample_size, axis=1)[:, :sample_size])
    combinations = np.sort(combinations)
    for i in combinations:

        yield population[I] 


Run Code Online (Sandbox Code Playgroud)

名词 最后一种方法是使用 for 但这非常昂贵

def random_combination2(population, sample, number = 3):
    for i in range(sample):
        pair = np.sort( gen.choice(len(population), size = number, replace = False))
        yield population[pair[0]], population[pair[1]], population[pair[2]]
Run Code Online (Sandbox Code Playgroud)

jot*_*tbe 1

不确定这是否比您的版本更快,但也许您可以尝试一下:

from random import shuffle
import numpy as np
#import pandas as pd  # activate this line, if you want to use pandas

# Just create a fake-population with randn, 
# you can just skip this line and set population 
# to your data. Just convert it to a numpy array
# in case it is not stored in one. 
# In case it consists of columns with different types,
# you can also use a pandas approach, which only differs in three lines
population= np.random.randn(10000, 1)
#population= pd.DataFrame(population) # activate this line if you want to try out pandas
indexes_orig= list(range(population.shape[0]))
shuffle(indexes_orig)
indexes= indexes_orig

m= 20  # m samples
n= 200 # of size n each
samples= list()
for i in range(m):
    sample_indexes= indexes[:n]
    sample_indexes.sort()
    indexes= indexes[n:]
    samples.append(population[sample_indexes, :])
    #samples.append(population.iloc[sample_indexes, :]) # uncomment this instead of the line above, if you want to use pandas, it assumes you use a 0 based index without gaps
Run Code Online (Sandbox Code Playgroud)