Max*_*xim 17 python performance numpy inline cython
考虑这样的代码:
import numpy as np
cimport numpy as np
cdef inline inc(np.ndarray[np.int32_t] arr, int i):
arr[i]+= 1
def test1(np.ndarray[np.int32_t] arr):
cdef int i
for i in xrange(len(arr)):
inc(arr, i)
def test2(np.ndarray[np.int32_t] arr):
cdef int i
for i in xrange(len(arr)):
arr[i] += 1
Run Code Online (Sandbox Code Playgroud)
我使用ipython来测量test1和test2的速度:
In [7]: timeit ttt.test1(arr)
100 loops, best of 3: 6.13 ms per loop
In [8]: timeit ttt.test2(arr)
100000 loops, best of 3: 9.79 us per loop
Run Code Online (Sandbox Code Playgroud)
有没有办法优化test1?为什么不把cython内联这个函数告诉?
更新:其实我需要的是这样的多维代码:
# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np
cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j):
arr[i, j] += 1
def test1(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
inc(arr, i, j)
def test2(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
arr[i,j] += 1
Run Code Online (Sandbox Code Playgroud)
时间安排:
In [7]: timeit ttt.test1(arr)
1 loops, best of 3: 647 ms per loop
In [8]: timeit ttt.test2(arr)
100 loops, best of 3: 2.07 ms per loop
Run Code Online (Sandbox Code Playgroud)
显式内联可提供300倍的加速.而且我的实际功能非常大,因此内联使代码可维护性更差
UPDATE2:
# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np
cdef inline inc(np.ndarray[np.float32_t, ndim=2] arr, int i, int j):
arr[i, j]+= 1
def test1(np.ndarray[np.float32_t, ndim=2] arr):
cdef int i,j
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
inc(arr, i, j)
def test2(np.ndarray[np.float32_t, ndim=2] arr):
cdef int i,j
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
arr[i,j] += 1
cdef class FastPassingFloat2DArray(object):
cdef float* data
cdef int stride0, stride1
def __init__(self, np.ndarray[np.float32_t, ndim=2] arr):
self.data = <float*>arr.data
self.stride0 = arr.strides[0]/arr.dtype.itemsize
self.stride1 = arr.strides[1]/arr.dtype.itemsize
def __getitem__(self, tuple tp):
cdef int i, j
cdef float *pr, r
i, j = tp
pr = (self.data + self.stride0*i + self.stride1*j)
r = pr[0]
return r
def __setitem__(self, tuple tp, float value):
cdef int i, j
cdef float *pr, r
i, j = tp
pr = (self.data + self.stride0*i + self.stride1*j)
pr[0] = value
cdef inline inc2(FastPassingFloat2DArray arr, int i, int j):
arr[i, j]+= 1
def test3(np.ndarray[np.float32_t, ndim=2] arr):
cdef int i,j
cdef FastPassingFloat2DArray tmparr = FastPassingFloat2DArray(arr)
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
inc2(tmparr, i,j)
Run Code Online (Sandbox Code Playgroud)
时序:
In [4]: timeit ttt.test1(arr)
1 loops, best of 3: 623 ms per loop
In [5]: timeit ttt.test2(arr)
100 loops, best of 3: 2.29 ms per loop
In [6]: timeit ttt.test3(arr)
1 loops, best of 3: 201 ms per loop
Run Code Online (Sandbox Code Playgroud)
Ali*_*Ali 18
问题发布已超过3年,同时取得了很大进展.在此代码上(问题的更新2):
# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np
cdef inline inc(np.ndarray[np.int32_t, ndim=2] arr, int i, int j):
arr[i, j]+= 1
def test1(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
inc(arr, i, j)
def test2(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
arr[i,j] += 1
Run Code Online (Sandbox Code Playgroud)
我得到以下时间:
arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test1(arr)
%timeit test2(arr)
1 loops, best of 3: 354 ms per loop
1000 loops, best of 3: 1.02 ms per loop
Run Code Online (Sandbox Code Playgroud)
所以即使超过3年,这个问题也是可以重现的.Cython现在已经输入了内存视图,AFAIK是在Cython 0.16中引入的,因此在发布问题时不可用.有了这个:
# cython: infer_types=True
# cython: boundscheck=False
# cython: wraparound=False
import numpy as np
cimport numpy as np
cdef inline inc(int[:, ::1] tmv, int i, int j):
tmv[i, j]+= 1
def test3(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
cdef int[:, ::1] tmv = arr
for i in xrange(tmv.shape[0]):
for j in xrange(tmv.shape[1]):
inc(tmv, i, j)
def test4(np.ndarray[np.int32_t, ndim=2] arr):
cdef int i,j
cdef int[:, ::1] tmv = arr
for i in xrange(tmv.shape[0]):
for j in xrange(tmv.shape[1]):
tmv[i,j] += 1
Run Code Online (Sandbox Code Playgroud)
有了这个我得到:
arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test3(arr)
%timeit test4(arr)
1000 loops, best of 3: 977 µs per loop
1000 loops, best of 3: 838 µs per loop
Run Code Online (Sandbox Code Playgroud)
我们几乎在那里,已经比老式的方式更快!现在,该inc()
函数有资格被声明nogil
,所以让我们声明它!但是oops:
Error compiling Cython file:
[...]
cdef inline inc(int[:, ::1] tmv, int i, int j) nogil:
^
[...]
Function with Python return type cannot be declared nogil
Run Code Online (Sandbox Code Playgroud)
啊,我完全错过了void
返回类型丢失!再一次,但现在void
:
cdef inline void inc(int[:, ::1] tmv, int i, int j) nogil:
tmv[i, j]+= 1
Run Code Online (Sandbox Code Playgroud)
最后我得到:
%timeit test3(arr)
%timeit test4(arr)
1000 loops, best of 3: 843 µs per loop
1000 loops, best of 3: 853 µs per loop
Run Code Online (Sandbox Code Playgroud)
和手动内联一样快!
现在,为了好玩,我在这段代码上尝试了Numba:
import numpy as np
from numba import autojit, jit
@autojit
def inc(arr, i, j):
arr[i, j] += 1
@autojit
def test5(arr):
for i in xrange(arr.shape[0]):
for j in xrange(arr.shape[1]):
inc(arr, i, j)
Run Code Online (Sandbox Code Playgroud)
我明白了:
arr = np.zeros((1000,1000), dtype=np.int32)
%timeit test5(arr)
100 loops, best of 3: 4.03 ms per loop
Run Code Online (Sandbox Code Playgroud)
即使它比Cython慢4.7倍,很可能因为JIT编译器无法内联inc()
,我认为它真棒!我需要做的就是添加@autojit
并且不必用笨拙的类型声明搞乱代码; 88x加速几乎没有!
我曾尝试使用Numba的其他东西,例如
@jit('void(i4[:],i4,i4)')
def inc(arr, i, j):
arr[i, j] += 1
Run Code Online (Sandbox Code Playgroud)
或者nopython=True
未能进一步改进.
改进内联是在Numba开发人员列表中,我们只需要提交更多请求以使其具有更高的优先级.;)
您将数组inc()
作为类型的Python对象传递给numpy.ndarray
.由于引用计数等问题,传递Python对象很昂贵,而且似乎阻止了内联.如果你传递数组C方式,即作为指针,test1()
变得比test2()
我的机器上更快:
cimport numpy as np
cdef inline inc(int* arr, int i):
arr[i] += 1
def test1(np.ndarray[np.int32_t] arr):
cdef int i
for i in xrange(len(arr)):
inc(<int*>arr.data, i)
Run Code Online (Sandbox Code Playgroud)
问题是分配一个numpy数组(或者,等效地,将其作为函数参数传递)不仅仅是一个简单的赋值,而是一个"缓冲区提取",它填充一个结构并将步幅和指针信息拉出到需要的局部变量中用于快速索引.如果你正在迭代中等数量的元素,这个O(1)开销很容易在循环中分摊,但对于小函数来说肯定不是这种情况.
对许多人的愿望清单来说,改善这一点很重要,但这是一个非平凡的变化.例如,请参阅http://groups.google.com/group/cython-users/browse_thread/thread/8fc8686315d7f3fe上的讨论