我试图在Python中以最有效的方式进行随机采样,但是,我很困惑,因为使用numpy的random.choices()比使用random.choices()慢
\n\nimport numpy as np\nimport random\n\nnp.random.seed(12345)\n\n# use gamma distribution\nshape, scale = 2.0, 2.0 \ns = np.random.gamma(shape, scale, 1000000)\nmeansample = []\n\nsamplesize = 500\n\n%timeit meansample = [ np.mean( np.random.choice( s, samplesize, replace=False)) for _ in range(500)]\n23.3 s \xc2\xb1 229 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n\n%timeit meansample = [np.mean(random.choices(s, k=samplesize)) for x in range(0,500)]\n152 ms \xc2\xb1 324 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 10 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n\n23 秒 vs 152 毫秒是很多时间
\n\n我做错了什么?
\n这里有两个问题。首先,对于纯 Pythonrandom
库,您可能意味着使用sample
而不是choices
不进行替换的采样。这在一定程度上改变了基准。其次,np.random.choice
有更好的性能替代方案来进行无需更换的采样。这是与随机生成器 API 相关的已知问题。您可以使用np.random.Generator
以获得更好的性能。我的时间安排:
%timeit meansample = [ np.mean( np.random.choice( s, samplesize, replace=False)) for _ in range(500)]
# 1 loop, best of 3: 12.4 s per loop
%timeit meansample = [np.mean(random.choices(s, k=samplesize)) for x in range(0,500)]
# 10 loops, best of 3: 118 ms per loop
sl = s.tolist()
%timeit meansample = [np.mean(random.sample(sl, k=samplesize)) for x in range(0,500)]
# 1 loop, best of 3: 219 ms per loop
g = np.random.Generator(np.random.PCG64())
%timeit meansample = [ np.mean( g.choice( s, samplesize, replace=False)) for _ in range(500)]
# 10 loops, best of 3: 25 ms per loop
Run Code Online (Sandbox Code Playgroud)
因此,在没有替换的情况下,random.sample
性能优于np.random.choice
但慢于np.random.Generator.choice
.