Rob*_*bon 9 matlab numpy matrix blas matrix-multiplication
我正在研究的算法需要在几个地方计算一种矩阵三元产品.
该操作采用具有相同尺寸的三个方形矩阵,并产生3指数张量.标记操作数A,B以及结果C的 (i,j,k)第th个元素
X[i,j,k] = \sum_a A[i,a] B[a,j] C[k,a]
Run Code Online (Sandbox Code Playgroud)
在numpy中,你可以用它来计算einsum('ia,aj,ka->ijk', A, B, C).
问题:
设nx n为矩阵大小.在Matlab中,你可以
A和C成n^2X n矩阵AC,使得行的AC对应于行的所有组合A和C.AC用B.这给出了期望的结果,仅以不同的形状.码:
AC = reshape(bsxfun(@times, permute(A, [1 3 2]), permute(C, [3 1 2])), n^2, n); % // 1
X = permute(reshape((AC*B).', n, n, n), [2 1 3]); %'// 2, 3
Run Code Online (Sandbox Code Playgroud)
检查基于逐字循环的方法:
%// Example data:
n = 3;
A = rand(n,n);
B = rand(n,n);
C = rand(n,n);
%// Proposed approach:
AC = reshape(bsxfun(@times, permute(A, [1 3 2]), permute(C, [3 1 2])), n^2, n);
X = permute(reshape((AC*B).', n, n, n), [2 1 3]); %'
%// Loop-based approach:
Xloop = NaN(n,n,n); %// initiallize
for ii = 1:n
for jj = 1:n
for kk = 1:n
Xloop(ii,jj,kk) = sum(A(ii,:).*B(:,jj).'.*C(kk,:)); %'
end
end
end
%// Compute maximum relative difference:
max(max(max(abs(X./Xloop-1))))
ans =
2.2204e-16
Run Code Online (Sandbox Code Playgroud)
最大相对差值的大小为eps,因此结果在数值精度范围内是正确的.
np.einsum,真的很难被击败,但在极少数情况下,你仍然可以击败它,如果你可以matrix-multiplication进入计算.经过几次试验,似乎你可以带来matrix-multiplication with np.dot超越性能np.einsum('ia,aj,ka->ijk', A, B, C).
其基本思想是,我们打破了"所有einsum"操作成的组合np.einsum和np.dot具体如下:
A:[i,a]和B:[a,j]完成.np.einsum3D array:[i,j,a]2D array:[i*j,a]和第三个阵列,C[k,a]转换为[a,k],以便matrix-multiplication在这两个之间执行,将我们[i*j,k]作为矩阵产品,因为我们在[a]那里丢失了索引.3D array:[i,j,k]最终输出.这是迄今为止讨论的第一个版本的实现 -
import numpy as np
def tensor_prod_v1(A,B,C): # First version of proposed method
# Shape parameters
m,d = A.shape
n = B.shape[1]
p = C.shape[0]
# Calculate \sum_a A[i,a] B[a,j] to get a 3D array with indices as (i,j,a)
AB = np.einsum('ia,aj->ija', A, B)
# Calculate entire summation losing a-ith index & reshaping to desired shape
return np.dot(AB.reshape(m*n,d),C.T).reshape(m,n,p)
Run Code Online (Sandbox Code Playgroud)
由于我们a-th在所有三个输入数组中求和索引,因此可以使用三种不同的方法来对第a个索引求和.前面列出的代码是(A,B).因此,我们也可以有(A,C)和(B,C)给我们两个的变化,未来所列:
def tensor_prod_v2(A,B,C):
# Shape parameters
m,d = A.shape
n = B.shape[1]
p = C.shape[0]
# Calculate \sum_a A[i,a] C[k,a] to get a 3D array with indices as (i,k,a)
AC = np.einsum('ia,ja->ija', A, C)
# Calculate entire summation losing a-ith index & reshaping to desired shape
return np.dot(AC.reshape(m*p,d),B).reshape(m,p,n).transpose(0,2,1)
def tensor_prod_v3(A,B,C):
# Shape parameters
m,d = A.shape
n = B.shape[1]
p = C.shape[0]
# Calculate \sum_a B[a,j] C[k,a] to get a 3D array with indices as (a,j,k)
BC = np.einsum('ai,ja->aij', B, C)
# Calculate entire summation losing a-ith index & reshaping to desired shape
return np.dot(A,BC.reshape(d,n*p)).reshape(m,n,p)
Run Code Online (Sandbox Code Playgroud)
根据输入数组的形状,不同的方法会相互产生不同的加速比,但我们希望所有all-einsum方法都比方法更好.性能数字列在下一节中.
这可能是最重要的部分,因为我们试图通过all-einsum最初在问题中提出的方法的三种变体来研究加速数.
数据集#1(等形数组):
In [494]: L1 = 200
...: L2 = 200
...: L3 = 200
...: al = 200
...:
...: A = np.random.rand(L1,al)
...: B = np.random.rand(al,L2)
...: C = np.random.rand(L3,al)
...:
In [495]: %timeit tensor_prod_v1(A,B,C)
...: %timeit tensor_prod_v2(A,B,C)
...: %timeit tensor_prod_v3(A,B,C)
...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
...:
1 loops, best of 3: 470 ms per loop
1 loops, best of 3: 391 ms per loop
1 loops, best of 3: 446 ms per loop
1 loops, best of 3: 3.59 s per loop
Run Code Online (Sandbox Code Playgroud)
数据集#2(更大的A):
In [497]: L1 = 1000
...: L2 = 100
...: L3 = 100
...: al = 100
...:
...: A = np.random.rand(L1,al)
...: B = np.random.rand(al,L2)
...: C = np.random.rand(L3,al)
...:
In [498]: %timeit tensor_prod_v1(A,B,C)
...: %timeit tensor_prod_v2(A,B,C)
...: %timeit tensor_prod_v3(A,B,C)
...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
...:
1 loops, best of 3: 442 ms per loop
1 loops, best of 3: 355 ms per loop
1 loops, best of 3: 303 ms per loop
1 loops, best of 3: 2.42 s per loop
Run Code Online (Sandbox Code Playgroud)
数据集#3(更大的B):
In [500]: L1 = 100
...: L2 = 1000
...: L3 = 100
...: al = 100
...:
...: A = np.random.rand(L1,al)
...: B = np.random.rand(al,L2)
...: C = np.random.rand(L3,al)
...:
In [501]: %timeit tensor_prod_v1(A,B,C)
...: %timeit tensor_prod_v2(A,B,C)
...: %timeit tensor_prod_v3(A,B,C)
...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
...:
1 loops, best of 3: 474 ms per loop
1 loops, best of 3: 247 ms per loop
1 loops, best of 3: 439 ms per loop
1 loops, best of 3: 2.26 s per loop
Run Code Online (Sandbox Code Playgroud)
数据集#4(更大的C):
In [503]: L1 = 100
...: L2 = 100
...: L3 = 1000
...: al = 100
...:
...: A = np.random.rand(L1,al)
...: B = np.random.rand(al,L2)
...: C = np.random.rand(L3,al)
In [504]: %timeit tensor_prod_v1(A,B,C)
...: %timeit tensor_prod_v2(A,B,C)
...: %timeit tensor_prod_v3(A,B,C)
...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
...:
1 loops, best of 3: 250 ms per loop
1 loops, best of 3: 358 ms per loop
1 loops, best of 3: 362 ms per loop
1 loops, best of 3: 2.46 s per loop
Run Code Online (Sandbox Code Playgroud)
数据集#5(更大的第a维长度):
In [506]: L1 = 100
...: L2 = 100
...: L3 = 100
...: al = 1000
...:
...: A = np.random.rand(L1,al)
...: B = np.random.rand(al,L2)
...: C = np.random.rand(L3,al)
...:
In [507]: %timeit tensor_prod_v1(A,B,C)
...: %timeit tensor_prod_v2(A,B,C)
...: %timeit tensor_prod_v3(A,B,C)
...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
...:
1 loops, best of 3: 373 ms per loop
1 loops, best of 3: 269 ms per loop
1 loops, best of 3: 299 ms per loop
1 loops, best of 3: 2.38 s per loop
Run Code Online (Sandbox Code Playgroud)
结论:我们看到8x-10x所提出的方法的变化比all-einsum问题中列出的方法更快.