Why is random.sample faster than numpy's random.choice?

Question

Why is random.sample faster than numpy's random.choice?

I need a way to sample without replacement a certain array a. I tried two approaches (see MCVE below), using random.sample() and np.random.choice.

I assumed the numpy function would be faster, but it turns out it is not. In my tests random.sample is ~15% faster than np.random.choice.

Is this correct, or am I doing something wrong in my example below? If this is correct, why?

import numpy as np
import random
import time
from contextlib import contextmanager


@contextmanager
def timeblock(label):
    start = time.clock()
    try:
        yield
    finally:
        end = time.clock()
        print ('{} elapsed: {}'.format(label, end - start))


def f1(a, n_sample):
    return random.sample(range(len(a)), n_sample)


def f2(a, n_sample):
    return np.random.choice(len(a), n_sample, replace=False)


# Generate random array
a = np.random.uniform(1., 100., 10000)
# Number of samples' indexes to randomly take from a
n_sample = 100
# Number of times to repeat functions f1 and f2
N = 100000

with timeblock("random.sample"):
    for _ in range(N):
        f1(a, n_sample)

with timeblock("np.random.choice"):
    for _ in range(N):
        f2(a, n_sample)

Run Code Online (Sandbox Code Playgroud)

Answer 1

pku*_*rov 8

TL;DR从 numpy v1.17.0 开始，建议使用numpy.random.default_rng()object 而不是numpy.random. 供选择：

import numpy as np

rng = np.random.default_rng()    # you can pass seed
rng.choice(...)    # interface is the same

Run Code Online (Sandbox Code Playgroud)

除了 v1.17 中引入的随机 API 的其他更改之外，这个新版本的选择现在更加智能，并且在大多数情况下应该是最快的。为了向后兼容，旧版本保持不变！

正如评论中提到的，numpy 中存在一个长期存在的问题，即与 python 标准库相比，np.random.choice实现无效。k << nrandom.sample

该问题np.random.choice(arr, size=k, replace=False)正在作为permutation(arr)[:k]. 在大数组和小k的情况下，计算整个数组排列是浪费时间和内存。标准 python 的random.sample工作方式更直接——它只是迭代采样而无需替换，要么跟踪已经采样的内容，要么跟踪采样的内容。

在 v1.17.0 中，numpy 引入了numpy.random包的返工和改进（文档、新功能、性能）。我强烈建议至少看一下第一个链接。请注意，正如那里所说，为了向后兼容，旧numpy.randomAPI 保持不变 - 它继续使用旧实现。

所以推荐的使用 random API 的新方法是使用numpy.random.default_rng() object而不是numpy.random. 请注意，它是一个对象，它也接受可选的种子参数，因此您可以以方便的方式传递它。默认情况下，它还使用不同的生成器，平均速度更快（有关详细信息，请参阅上面的性能链接）。

关于您的情况，您np.random.default_rng().choice(...)现在可能想使用。除了速度更快之外，由于改进的随机生成器，它choice本身变得更加智能。现在它仅对足够大的数组（> 10000 个元素）和相对较大的 k（> 1/50 的大小）使用整个数组置换。否则，它使用 Floyd 的采样算法（简短描述，numpy 实现）。

这是我的笔记本电脑的性能比较：

来自 10000 个元素 x 10000 次的数组的 100 个样本：

random.sample elapsed: 0.8711776689742692
np.random.choice elapsed: 1.9704092079773545
np.random.default_rng().choice elapsed: 0.818919860990718

Run Code Online (Sandbox Code Playgroud)

来自 10000 个元素 x 10000 次的数组的 1000 个样本：

random.sample elapsed: 8.785315042012371
np.random.choice elapsed: 1.9777243090211414
np.random.default_rng().choice elapsed: 1.05490942299366

Run Code Online (Sandbox Code Playgroud)

来自 10000 个元素 x 10000 次的数组的 10000 个样本：

random.sample elapsed: 80.15063399000792
np.random.choice elapsed: 2.0218082449864596
np.random.default_rng().choice elapsed: 2.8596064270241186

Run Code Online (Sandbox Code Playgroud)

我使用的代码：

import numpy as np
import random
from timeit import default_timer as timer
from contextlib import contextmanager


@contextmanager
def timeblock(label):
    start = timer()
    try:
        yield
    finally:
        end = timer()
        print ('{} elapsed: {}'.format(label, end - start))


def f1(a, n_sample):
    return random.sample(range(len(a)), n_sample)


def f2(a, n_sample):
    return np.random.choice(len(a), n_sample, replace=False)


def f3(a, n_sample):
    return np.random.default_rng().choice(len(a), n_sample, replace=False)


# Generate random array
a = np.random.uniform(1., 100., 10000)
# Number of samples' indexes to randomly take from a
n_sample = 100
# Number of times to repeat tested functions
N = 100000

print(f'{N} times {n_sample} samples')
with timeblock("random.sample"):
    for _ in range(N):
        f1(a, n_sample)

with timeblock("np.random.choice"):
    for _ in range(N):
        f2(a, n_sample)

with timeblock("np.random.default_rng().choice"):
    for _ in range(N):
        f3(a, n_sample)

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，9 月前
查看次数：	5334 次
最近记录：	4 年，6 月前