为什么numpy的einsum比numpy的内置函数慢?

bog*_*ron 11 python performance numpy linear-algebra

我通常从numpy的einsum函数中获得了很好的表现(我喜欢它的语法).@Ophion对这个问题的回答表明 - 对于测试的案例 - einsum始终优于"内置"功能(有时候会有一些,有时会很多).但我刚遇到一个einsum慢得多的情况.考虑以下等效函数:

(M, K) = (1000000, 20)
C = np.random.rand(K, K)
X = np.random.rand(M, K)

def func_dot(C, X):
    Y = X.dot(C)
    return np.sum(Y * X, axis=1)

def func_einsum(C, X):
    return np.einsum('ik,km,im->i', X, C, X)

def func_einsum2(C, X):
    # Like func_einsum but break it into two steps.
    A = np.einsum('ik,km', X, C)
    return np.einsum('ik,ik->i', A, X)
Run Code Online (Sandbox Code Playgroud)

我希望func_einsum跑得最快,但这不是我遇到的.在具有超线程,numpy版本1.9.0.dev-7ae0206的四核CPU上运行,以及使用OpenBLAS进行多线程处理,我得到以下结果:

In [2]: %time y1 = func_dot(C, X)
CPU times: user 320 ms, sys: 312 ms, total: 632 ms
Wall time: 209 ms
In [3]: %time y2 = func_einsum(C, X)
CPU times: user 844 ms, sys: 0 ns, total: 844 ms
Wall time: 842 ms
In [4]: %time y3 = func_einsum2(C, X)
CPU times: user 292 ms, sys: 44 ms, total: 336 ms
Wall time: 334 ms
Run Code Online (Sandbox Code Playgroud)

当我增加到K200时,差异更加极端:

In [2]: %time y1= func_dot(C, X)
CPU times: user 4.5 s, sys: 1.02 s, total: 5.52 s
Wall time: 2.3 s
In [3]: %time y2= func_einsum(C, X)
CPU times: user 1min 16s, sys: 44 ms, total: 1min 16s
Wall time: 1min 16s
In [4]: %time y3 = func_einsum2(C, X)
CPU times: user 15.3 s, sys: 312 ms, total: 15.6 s
Wall time: 15.6 s
Run Code Online (Sandbox Code Playgroud)

有人可以解释为什么einsum这么慢吗?

如果重要,这是我的numpy配置:

In [6]: np.show_config()
lapack_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
atlas_threads_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_WITHOUT_LAPACK', None)]
    language = c
    include_dirs = ['/usr/local/include']
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = c
    include_dirs = ['/usr/local/include']
atlas_blas_threads_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_INFO', '"\\"None\\""')]
    language = c
    include_dirs = ['/usr/local/include']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('ATLAS_WITHOUT_LAPACK', None)]
    language = f77
    include_dirs = ['/usr/local/include']
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE
Run Code Online (Sandbox Code Playgroud)

Jai*_*ime 17

你可以充分利用这两个方面:

def func_dot_einsum(C, X):
    Y = X.dot(C)
    return np.einsum('ij,ij->i', Y, X)
Run Code Online (Sandbox Code Playgroud)

在我的系统上:

In [7]: %timeit func_dot(C, X)
10 loops, best of 3: 31.1 ms per loop

In [8]: %timeit func_einsum(C, X)
10 loops, best of 3: 105 ms per loop

In [9]: %timeit func_einsum2(C, X)
10 loops, best of 3: 43.5 ms per loop

In [10]: %timeit func_dot_einsum(C, X)
10 loops, best of 3: 21 ms per loop
Run Code Online (Sandbox Code Playgroud)

如果可用,np.dot使用BLAS,MKL或您拥有的任何库.因此呼叫np.dot几乎肯定是多线程的.np.einsum它有自己的循环,因此不使用任何这些优化,除了它自己使用SIMD来加速通过vanilla C实现.


然后是多输入einsum调用运行得慢得多...... einsum的numpy源非常复杂,我不完全理解它.所以请注意以下内容充其量只是推测,但这是我认为正在发生的事情......

当你运行类似的东西时np.einsum('ij,ij->i', a, b),实现的好处np.sum(a*b, axis=1)来自于避免必须用所有产品实例化中间数组,并在其上循环两次.所以在低级别发生的事情是这样的:

for i in range(I):
    out[i] = 0
    for j in range(J):
        out[i] += a[i, j] * b[i, j]
Run Code Online (Sandbox Code Playgroud)

现在说你正在追求类似的东西:

np.einsum('ij,jk,ik->i', a, b, c)
Run Code Online (Sandbox Code Playgroud)

你可以做同样的操作

np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))
Run Code Online (Sandbox Code Playgroud)

而我认为einsum所做的就是运行这个最后的代码,而不必实例化巨大的中间数组,这肯定会让事情变得更快:

In [29]: a, b, c = np.random.rand(3, 100, 100)

In [30]: %timeit np.einsum('ij,jk,ik->i', a, b, c)
100 loops, best of 3: 2.41 ms per loop

In [31]: %timeit np.sum(a[:, :, None] * b[None, :, :] * c[:, None, :], axis=(1, 2))
100 loops, best of 3: 12.3 ms per loop
Run Code Online (Sandbox Code Playgroud)

但是如果你仔细看一下,摆脱中间存储可能是一件可怕的事情.这就是我认为einsum在低级别做的事情:

for i in range(I):
    out[i] = 0
    for j in range(J):
        for k in range(K):
            out[i] += a[i, j] * b[j, k] * c[i, k]
Run Code Online (Sandbox Code Playgroud)

但是你正在重复大量的操作!如果您改为:

for i in range(I):
    out[i] = 0
    for j in range(J):
        temp = 0
        for k in range(K):
            temp += b[j, k] * c[i, k]
        out[i] += a[i, j] * temp
Run Code Online (Sandbox Code Playgroud)

你会I * J * (K-1)减少乘法(和I * J额外的加法),并节省大量的时间.我的猜测是,einsum不够智能,不能在这个级别上优化事物.在源代码中有一个提示,它只用1或2个操作数优化操作,而不是3.在任何情况下,为一般输入自动执行此操作似乎只是简单...

  • 更新:在 numpy >= 1.12.0 中,有一个命名参数“optimize”会告诉 numpy 进行优化。它不是默认的原因是内存问题(可能是向后兼容性)。 (3认同)
  • 在上面的示例中添加“optimize=True”:“%timeit np.einsum('ij,jk,ik->i', a, b, c)”将时间从 **2.4 ms** 降低到 ** 410 微秒**! (3认同)

hpa*_*ulj 6

einsum对 '2 个操作数,ndim=2' 有一个特例。在这种情况下,有 3 个操作数,总共有 3 个维度。所以它必须使用一个通用的nditer.

在尝试了解字符串输入是如何解析的时,我编写了一个纯 Python einsum 模拟器,https://github.com/hpaulj/numpy-einsum/blob/master/einsum_py.py

(精简) einsum 和 sum-of-products 函数是:

def myeinsum(subscripts, *ops, **kwargs):
    # dropin preplacement for np.einsum (more or less)
    <parse subscript strings>
    <prepare op_axes>
    x = sum_of_prod(ops, op_axes, **kwargs)
    return x

def sum_of_prod(ops, op_axes,...):
    ...
    it = np.nditer(ops, flags, op_flags, op_axes)
    it.operands[nop][...] = 0
    it.reset()
    for (x,y,z,w) in it:
        w[...] += x*y*z
    return it.operands[nop]
Run Code Online (Sandbox Code Playgroud)

调试输出myeinsum('ik,km,im->i',X,C,X,debug=True)(M,K)=(10,5)

{'max_label': 109, 
 'min_label': 105, 
 'nop': 3, 
 'shapes': [(10, 5), (5, 5), (10, 5)], 
 ....}}
 ...
iter labels: [105, 107, 109],'ikm'

op_axes [[0, 1, -1], [-1, 0, 1], [0, -1, 1], [0, -1, -1]]
Run Code Online (Sandbox Code Playgroud)

如果你在中编写这样的sum-of-prod函数,cython你应该得到一些接近于广义einsum.

使用 full (M,K),这个模拟的 einsum 慢了 6-7 倍。


基于其他答案的一些时间安排:

In [84]: timeit np.dot(X,C)
1 loops, best of 3: 781 ms per loop

In [85]: timeit np.einsum('ik,km->im',X,C)
1 loops, best of 3: 1.28 s per loop

In [86]: timeit np.einsum('im,im->i',A,X)
10 loops, best of 3: 163 ms per loop
Run Code Online (Sandbox Code Playgroud)

'im,im->i' step is substantially faster than the other. The sum dimension,is only 20. I suspecteinsum`是这当作一种特殊情况。

In [87]: timeit np.einsum('im,im->i',np.dot(X,C),X)
1 loops, best of 3: 950 ms per loop

In [88]: timeit np.einsum('im,im->i',np.einsum('ik,km->im',X,C),X)
1 loops, best of 3: 1.45 s per loop
Run Code Online (Sandbox Code Playgroud)

这些复合计算的时间只是相应部分的总和。