加速numpy.dot

NPE*_*NPE 11 python performance numpy dot-product

我有一个numpy脚本,它在以下代码中占用了大约50%的运行时间:

s = numpy.dot(v1, v1)

哪里

v1 = v[1:]

v是一个4000元件1D ndarrayfloat64存储在连续的存储器(v.strides(8,)).

有什么建议加快这个?

编辑这是在Intel硬件上.这是我的输出numpy.show_config():

atlas_threads_info:
    libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/local/atlas-3.9.16/lib']
    language = f77
    include_dirs = ['/usr/local/atlas-3.9.16/include']

blas_opt_info:
    libraries = ['ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/local/atlas-3.9.16/lib']
    define_macros = [('ATLAS_INFO', '"\\"3.9.16\\""')]
    language = c
    include_dirs = ['/usr/local/atlas-3.9.16/include']

atlas_blas_threads_info:
    libraries = ['ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/local/atlas-3.9.16/lib']
    language = c
    include_dirs = ['/usr/local/atlas-3.9.16/include']

lapack_opt_info:
    libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/local/atlas-3.9.16/lib']
    define_macros = [('ATLAS_INFO', '"\\"3.9.16\\""')]
    language = f77
    include_dirs = ['/usr/local/atlas-3.9.16/include']

lapack_mkl_info:
  NOT AVAILABLE

blas_mkl_info:
  NOT AVAILABLE

mkl_info:
  NOT AVAILABLE
Run Code Online (Sandbox Code Playgroud)

dou*_*oug 5

也许罪魁祸首是复制传递给dot的数组.

正如斯文所说,积依赖于BLAS操作.这些操作需要以连续的C顺序存储的数组.如果传递给dot的两个数组都在C_CONTIGUOUS中,那么你应该看到更好的性能.

当然,如果你的两个数组传递给点确实1D(8),那么你应该会看到两个的C_CONTIGUOUS并设置为True F_CONTIGUOUS标志; 但如果它们是(1,8),那么你可以看到混合顺序.

>>> w = NP.random.randint(0, 10, 100).reshape(100, 1)
>>> w.flags
   C_CONTIGUOUS : True
   F_CONTIGUOUS : False
   OWNDATA : False
   WRITEABLE : True
   ALIGNED : True
   UPDATEIFCOPY : False
Run Code Online (Sandbox Code Playgroud)


另一种方法:使用BLAS中的_GEMM,它通过模块scipy.linalg.fblas公开.(两个数组A和B显然是Fortran顺序,因为使用了fblas.)

from scipy.linalg import fblas as FB
X = FB.dgemm(alpha=1., a=A, b=B, trans_b=True)
Run Code Online (Sandbox Code Playgroud)


mat*_*att 5

你的阵列不是很大,所以ATLAS可能做得不多.您对以下Fortran计划的时间安排是什么?假设ATLAS没有做太多,这应该让你了解如果没有任何python开销,dot()的速度有多快.使用gfortran -O3,我获得了5 +/- 0.5 us的速度.

    program test

    real*8 :: x(4000), start, finish, s
    integer :: i, j
    integer,parameter :: jmax = 100000

    x(:) = 4.65
    s = 0.
    call cpu_time(start)
    do j=1,jmax
        s = s + dot_product(x, x)
    enddo
    call cpu_time(finish)
    print *, (finish-start)/jmax * 1.e6, s

    end program test
Run Code Online (Sandbox Code Playgroud)