Bri*_*ack 4 python parallel-processing numexpr numba
我一直在尝试优化一段涉及大型多维数组计算的python代码.我对numba的结果有违反直觉.我正在运行MBP,2015年中期,2.5 GHz i7 quadcore,OS 10.10.5,python 2.7.11.考虑以下:
import numpy as np
from numba import jit, vectorize, guvectorize
import numexpr as ne
import timeit
def add_two_2ds_naive(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
@jit
def add_two_2ds_jit(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
@guvectorize(['float64[:,:],float64[:,:],float64[:,:]'],
'(n,m),(n,m)->(n,m)',target='cpu')
def add_two_2ds_cpu(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
@guvectorize(['(float64[:,:],float64[:,:],float64[:,:])'],
'(n,m),(n,m)->(n,m)',target='parallel')
def add_two_2ds_parallel(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
def add_two_2ds_numexpr(A,B,res):
res = ne.evaluate('A+B')
if __name__=="__main__":
np.random.seed(69)
A = np.random.rand(10000,100)
B = np.random.rand(10000,100)
res = np.zeros((10000,100))
Run Code Online (Sandbox Code Playgroud)
我现在可以在各种功能上运行timeit:
%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.16 ms per loop
%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.19 ms per loop
%timeit add_two_2ds_parallel(A,B,res)
100 loops, best of 3: 6.9 ms per loop
%timeit add_two_2ds_numexpr(A,B,res)
1000 loops, best of 3: 1.62 ms per loop
Run Code Online (Sandbox Code Playgroud)
看起来"并行"并没有采用大多数单核,因为它top表明python的'并行'约为40%cpu,'cpu'约为100%,而且numxpr达到~300% .
小智 5
您的@guvectorize实现有两个问题.第一个是你正在你的@guvectorize内核中进行所有循环,所以Numba并行目标实际上并没有并行化.@vectorize和@guvectorize都在ufunc/gufunc中对广播维度进行并行化.由于gufunc的签名是2D,并且您的输入是2D,因此只有一个内部函数调用,这解释了您看到的CPU使用率仅为100%.
编写上面这个函数的最好方法是使用常规的ufunc:
@vectorize('(float64, float64)', target='parallel')
def add_ufunc(a, b):
return a + b
Run Code Online (Sandbox Code Playgroud)
然后在我的系统上,我看到这些速度:
%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.87 ms per loop
%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.81 ms per loop
%timeit add_two_2ds_parallel(A,B,res)
The slowest run took 11.82 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 2.43 ms per loop
%timeit add_two_2ds_numexpr(A,B,res)
100 loops, best of 3: 2.79 ms per loop
%timeit add_ufunc(A, B, res)
The slowest run took 9.24 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 2.03 ms per loop
Run Code Online (Sandbox Code Playgroud)
(这是一个非常类似的OS X系统,但是使用OS X 10.11.)
虽然Numba的并行ufunc现在胜过numexpr(我看到add_ufunc使用了大约280%的CPU),但它并没有超过简单的单线程CPU案例.我怀疑瓶颈是由于内存(或缓存)带宽,但我还没有完成测量来检查.
一般来说,如果你对每个内存元素进行更多的数学运算(例如,余弦),你会发现并行ufunc目标会带来更多好处.