var*_*tir 7 python arrays performance numpy linear-algebra
由于我np.dot被OpenBlas和Openmpi加速,我想知道是否有可能写出双倍数额
for i in range(N):
for j in range(N):
B[k,l] += A[i,j,k,l] * X[i,j]
Run Code Online (Sandbox Code Playgroud)
作为内在产品.就在我正在使用的那一刻
B = np.einsum("ijkl,ij->kl",A,X)
Run Code Online (Sandbox Code Playgroud)
但不幸的是它很慢,只使用一个处理器.有任何想法吗?
编辑:我用一个简单的例子对给出的答案进行了基准测试,似乎它们都处于同一数量级:
A = np.random.random([200,200,100,100])
X = np.random.random([200,200])
def B1():
return es("ijkl,ij->kl",A,X)
def B2():
return np.tensordot(A, X, [[0,1], [0, 1]])
def B3():
shp = A.shape
return np.dot(X.ravel(),A.reshape(shp[0]*shp[1],1)).reshape(shp[2],shp[3])
%timeit B1()
%timeit B2()
%timeit B3()
1 loops, best of 3: 300 ms per loop
10 loops, best of 3: 149 ms per loop
10 loops, best of 3: 150 ms per loop
Run Code Online (Sandbox Code Playgroud)
从这些结果得出结论,我会选择np.einsum,因为它的语法仍然是最可读的,而其他两个的改进只是2倍.我想下一步是将代码外部化为C或fortran.
你可以使用np.tensordot():
np.tensordot(A, X, [[0,1], [0, 1]])
Run Code Online (Sandbox Code Playgroud)
它确实使用多个核心.
编辑:当增加输入数组的大小时,看看如何np.einsum和np.tensordot缩放是很有趣的:
In [18]: for n in range(1, 31):
....: A = np.random.rand(n, n+1, n+2, n+3)
....: X = np.random.rand(n, n+1)
....: print(n)
....: %timeit np.einsum('ijkl,ij->kl', A, X)
....: %timeit np.tensordot(A, X, [[0, 1], [0, 1]])
....:
1
1000000 loops, best of 3: 1.55 µs per loop
100000 loops, best of 3: 8.36 µs per loop
...
11
100000 loops, best of 3: 15.9 µs per loop
100000 loops, best of 3: 17.2 µs per loop
12
10000 loops, best of 3: 23.6 µs per loop
100000 loops, best of 3: 18.9 µs per loop
...
21
10000 loops, best of 3: 153 µs per loop
10000 loops, best of 3: 44.4 µs per loop
Run Code Online (Sandbox Code Playgroud)
并且很明显使用tensordot更大的阵列的优势.