And*_*cki 3 python performance loops for-loop numpy
我有一个缓慢的循环,我希望通过将其推入numpy(更快).我花了好几天玩这个代码而没有任何地方.它是否可能,或者是否有一个我错过的numpy技巧?我可以做一些重构来帮助吗?
正如你所看到的,我想把mixin的总和换成xs.
import numpy as np
blocksize = 1000 # Chosen at runtime.
mixinsize = 100 # Chosen at runtime.
count = 10000 # Chosen at runtime.
xs = np.random.randint(0, blocksize + 1, count) # In practice this is data.
mixins = np.empty((count, mixinsize)) # In practice this is data.
# The slow part:
accumulator = np.zeros(blocksize + mixinsize)
for i in xrange(count):
accumulator[xs[i]:xs[i] + mixinsize] += mixins[i]
Run Code Online (Sandbox Code Playgroud)
从numba.pydata.org获取Numba 0.11(不是0.12).现在我们可以用LLVM jit编译这段代码:
# plain NumPy version
import numpy as np
def foobar(mixinsize, count, xs, mixins, acc):
for i in xrange(count):
k = xs[i]
acc[k:k + mixinsize] += mixins[i,:]
# LLVM compiled version
from numba import jit, void, int64, double
signature = void(int64,int64,int64[:],double[:,:],double[:])
foobar_jit = jit(signature)(foobar)
Run Code Online (Sandbox Code Playgroud)
if __name__ == "__main__":
from time import clock
blocksize = 1000 # Chosen at runtime.
mixinsize = 100 # Chosen at runtime.
count = 100000 # Chosen at runtime.
xs = np.random.randint(0, blocksize + 1, count)
mixins = np.empty((count, mixinsize))
acc = np.zeros(blocksize + mixinsize)
t0 = clock()
foobar(mixinsize, count, xs, mixins, acc)
t1 = clock()
print("elapsed time: %g ms" % (1000*(t1-t0),))
t2 = clock()
foobar_jit(mixinsize, count, xs, mixins, acc)
t3 = clock()
print("elapsed time with numba jit: %g ms" % (1000*(t3-t2),))
print("speedup factor: %g" % ((t1-t0)/(t3-t2),))
Run Code Online (Sandbox Code Playgroud)
$ python test_numba.py
elapsed time: 590.632 ms
elapsed time with numba jit: 12.31 ms
speedup factor: 47.9799
Run Code Online (Sandbox Code Playgroud)
好的,所以这几乎是50倍的加速,只需增加三行Python代码.
现在我们还可以使用clang/LLVM作为编译器来测试普通的C版本以进行比较.
void foobar(long mixinsize, long count,
long *xs, double *mixins, double *accumulator)
{
long i, j, k;
double *cur, *acc;
for (i=0;i<count;i++) {
acc = accumulator + xs[i];
cur = mixins + i*mixinsize;
for(j=0;j<mixinsize;j++) *acc++ += *cur++;
}
}
Run Code Online (Sandbox Code Playgroud)
from numpy.ctypeslib import ndpointer
import ctypes
so = ctypes.CDLL('plainc.so')
foobar_c = so.foobar
foobar_c.restype = None
foobar_c.argtypes = (
ctypes.c_long,
ctypes.c_long,
ndpointer(dtype=np.int64, ndim=1),
ndpointer(dtype=np.float64, ndim=2),
ndpointer(dtype=np.float64, ndim=1)
)
t4 = clock()
foobar_c(mixinsize, count, xs, mixins, acc)
t5 = clock()
print("elapsed time with plain C: %g ms" % (1000*(t5-t4),))
Run Code Online (Sandbox Code Playgroud)
$ CC -Ofast -shared -m64 -o plainc.so plainc.c
$ python test_numba.py
elapsed time: 599.136 ms
elapsed time with numba jit: 11.958 ms
speedup factor: 50.1034
elapsed time with plain C: 5.472 ms
Run Code Online (Sandbox Code Playgroud)
因此,当使用-Ofast进行优化时,Numba的速度大约是普通C版本的一半.相比之下,使用-O2的运行时间约为8毫秒.这意味着在这种情况下Numba JIT编译Python的速度大约是C的-75优化标志的75%.仅仅增加三行Python代码就不错了.
我们可以比较一下普通的Python版本:
def foobar_py(mixinsize, count, xs, mixins, acc):
for i in xrange(count):
k = xs[i]
for j in xrange(mixinsize):
acc[j+k] += mixins[i][j]
# covert NumPy arrays to lists
_xs = map(int,xs)
_mixins = [map(float,mixins[i,:]) for i in xrange(count)]
_acc = map(float,acc)
t6 = clock()
foobar_py(mixinsize, count, _xs, _mixins, _acc)
t7 = clock()
print("elapsed time with plain Python: %g ms" % (1000*(t7-t6),))
Run Code Online (Sandbox Code Playgroud)
这段Python代码在1775毫秒内执行.因此,相对于普通Python,我们可以使用NumPy获得大约3倍的加速,使用Numba获得150倍的加速,使用C和-Ofast获得350倍的加速.
唐纳德·克努特(Donald Knuth)谨慎对待,他将此归因于CAR Hoare:"过早优化是计算机编程中所有邪恶的根源." 虽然这似乎是令人印象深刻的相对加速,但沿着这条路线的绝对加速只允许我们节省几毫秒的CPU时间.我是否真的值得花时间从那个劳动量中节省CPU?值得你花时间吗?自行决定.
| 归档时间: |
|
| 查看次数: |
372 次 |
| 最近记录: |