ojy*_*ojy 6 python optimization performance numpy cython
考虑一个维数NxM的numpy数组A. 目标是计算欧几里德距离矩阵D,其中每个元素D [i,j]是行i和j之间的核心距离.这样做的最快方法是什么?这不是我需要解决的问题,但它是我正在尝试做的一个很好的例子(通常,可以使用其他距离度量).
这是迄今为止我能想到的最快的:
n = A.shape[0]
D = np.empty((n,n))
for i in range(n):
D[i] = np.sqrt(np.square(A-A[i]).sum(1))
Run Code Online (Sandbox Code Playgroud)
但这是最快的方式吗?我主要关注for循环.我们可以用Cython打败这个吗?
为了避免循环,我尝试使用广播,并执行以下操作:
D = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
Run Code Online (Sandbox Code Playgroud)
但事实证明这是一个坏主意,因为构建维度NxNxM的中间3D阵列会产生一些开销,因此性能更差.
我试过Cython.但我是Cython的新手,所以我不知道我的尝试有多好:
def dist(np.ndarray[np.int32_t, ndim=2] A):
cdef int n = A.shape[0]
cdef np.ndarray[np.float64_t, ndim=2] dm = np.empty((n,n), dtype=np.float64)
cdef int i = 0
for i in range(n):
dm[i] = np.sqrt(np.square(A-A[i]).sum(1)).astype(np.float64)
return dm
Run Code Online (Sandbox Code Playgroud)
上面的代码比Python的for循环慢一点.我对Cython知之甚少,但我认为我可以达到至少与for loop + numpy相同的性能.我想知道在正确的方式下是否有可能实现一些显着的性能提升?或者是否还有其他方法可以加快速度(不涉及并行计算)?
Cython的关键是避免尽可能多地使用Python对象和函数调用,包括对numpy数组的向量化操作.这通常意味着手动写出所有循环并一次操作单个数组元素.
这里有一个非常有用的教程,涵盖了将numpy代码转换为Cython并对其进行优化的过程.
这是一个快速刺激您的距离函数更优化的Cython版本:
import numpy as np
cimport numpy as np
cimport cython
# don't use np.sqrt - the sqrt function from the C standard library is much
# faster
from libc.math cimport sqrt
# disable checks that ensure that array indices don't go out of bounds. this is
# faster, but you'll get a segfault if you mess up your indexing.
@cython.boundscheck(False)
# this disables 'wraparound' indexing from the end of the array using negative
# indices.
@cython.wraparound(False)
def dist(double [:, :] A):
# declare C types for as many of our variables as possible. note that we
# don't necessarily need to assign a value to them at declaration time.
cdef:
# Py_ssize_t is just a special platform-specific type for indices
Py_ssize_t nrow = A.shape[0]
Py_ssize_t ncol = A.shape[1]
Py_ssize_t ii, jj, kk
# this line is particularly expensive, since creating a numpy array
# involves unavoidable Python API overhead
np.ndarray[np.float64_t, ndim=2] D = np.zeros((nrow, nrow), np.double)
double tmpss, diff
# another advantage of using Cython rather than broadcasting is that we can
# exploit the symmetry of D by only looping over its upper triangle
for ii in range(nrow):
for jj in range(ii + 1, nrow):
# we use tmpss to accumulate the SSD over each pair of rows
tmpss = 0
for kk in range(ncol):
diff = A[ii, kk] - A[jj, kk]
tmpss += diff * diff
tmpss = sqrt(tmpss)
D[ii, jj] = tmpss
D[jj, ii] = tmpss # because D is symmetric
return D
Run Code Online (Sandbox Code Playgroud)
我将其保存在一个名为的文件中fastdist.pyx.我们可以pyximport用来简化构建过程:
import pyximport
pyximport.install()
import fastdist
import numpy as np
A = np.random.randn(100, 200)
D1 = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
D2 = fastdist.dist(A)
print np.allclose(D1, D2)
# True
Run Code Online (Sandbox Code Playgroud)
所以它起作用,至少.让我们使用%timeit魔术做一些基准测试:
%timeit np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
# 100 loops, best of 3: 10.6 ms per loop
%timeit fastdist.dist(A)
# 100 loops, best of 3: 1.21 ms per loop
Run Code Online (Sandbox Code Playgroud)
加速速度提高了~9倍,但并不是真正的游戏改变者.正如你所说,广播方法的最大问题是构建中间阵列的内存要求.
A2 = np.random.randn(1000, 2000)
%timeit fastdist.dist(A2)
# 1 loops, best of 3: 1.36 s per loop
Run Code Online (Sandbox Code Playgroud)
我不建议尝试使用广播......
我们可以做的另一件事是使用以下prange函数在最外层循环上并行化:
from cython.parallel cimport prange
...
for ii in prange(nrow, nogil=True, schedule='guided'):
...
Run Code Online (Sandbox Code Playgroud)
为了编译并行版本,您需要告诉编译器启用OpenMP.我还没弄明白如何使用pyximport,但如果你正在使用,gcc你可以像这样手动编译:
$ cython fastdist.pyx
$ gcc -shared -pthread -fPIC -fwrapv -fopenmp -O3 \
-Wall -fno-strict-aliasing -I/usr/include/python2.7 -o fastdist.so fastdist.c
Run Code Online (Sandbox Code Playgroud)
使用并行性,使用8个线程:
%timeit D2 = fastdist.dist_parallel(A2)
1 loops, best of 3: 509 ms per loop
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2371 次 |
| 最近记录: |