Ope*_*son 4 matlab vectorization multidimensional-array matrix-multiplication
我主要在MATLAB工作,但我认为答案不应该太难以从一种语言延伸到另一种语言.
我有一个多维阵列X,其尺寸[n, p, 3].我想计算下面的多维数组.
T = zeros(p, p, p)
for i = 1:p
for j = 1:p
for k = 1:p
T(i, j, k) = sum(X(:, i, 1) .* X(:, j, 2) .* X(:, k, 3));
end
end
end
Run Code Online (Sandbox Code Playgroud)
总和是长度n矢量的元素.任何帮助表示赞赏!
您只需要对尺寸进行一些置换,并使用单例扩展进行乘法运算:
T = sum(bsxfun(@times, bsxfun(@times, permute(X(:,:,1), [2 4 5 3 1]), permute(X(:,:,2), [4 2 5 3 1])), permute(X(:,:,3), [4 5 2 3 1])), 5);
Run Code Online (Sandbox Code Playgroud)
从R2016b开始,这可以更容易地编写
T = sum(permute(X(:,:,1), [2 4 5 3 1]) .* permute(X(:,:,2), [4 2 5 3 1]) .* permute(X(:,:,3), [4 5 2 3 1]), 5);
Run Code Online (Sandbox Code Playgroud)
正如我在评论中提到的,矢量化并不总是一个巨大的优势.因此,有一些矢量化方法可以减慢代码速度而不是加速代码.您必须始终为解决方案计时.矢量化通常涉及创建大型临时数组,或者在循环代码中避免的大量数据的副本.这取决于架构,输入的大小以及许多其他因素,如果这样的解决方案会更快.
尽管如此,在这种情况下,似乎矢量化方法可以产生大的加速.
关于原始代码的首要注意事项是X(:, i, 1) .* X(:, j, 2)在内部循环中重新计算,尽管它是一个常量值.重写内循环,这样可以节省时间:
Y = X(:, i, 1) .* X(:, j, 2);
for k = 1:p
T(i, j, k) = sum(Y .* X(:, k, 3));
end
Run Code Online (Sandbox Code Playgroud)
现在我们注意到内部循环是一个点积,可以写成如下:
Y = X(:, i, 1) .* X(:, j, 2);
T(i, j, :) = Y.' * X(:, :, 3);
Run Code Online (Sandbox Code Playgroud)
该.'转就Y不会复制数据,Y是一个载体.接下来,我们注意到X(:, :, 3)重复索引.让我们将其移出外循环.现在我留下以下代码:
T = zeros(p, p, p);
X1 = X(:, :, 1);
X2 = X(:, :, 2);
X3 = X(:, :, 3);
for i = 1:p
for j = 1:p
Y = X1(:, i) .* X2(:, j);
T(i, j, :) = Y.' * X3;
end
end
Run Code Online (Sandbox Code Playgroud)
删除循环可能j同样容易,这将留下一个循环i.但这是我停下来的地方.
这是我看到的时间(R2017a,3岁的iMac,有4个核心).用于n=10, p=20:
original: 0.0206
moving Y out the inner loop: 0.0100
removing inner loop: 0.0016
moving indexing out of loops: 7.6294e-04
Luis' answer: 1.9196e-04
Run Code Online (Sandbox Code Playgroud)
对于更大的阵列n=50, p=100:
original: 2.9107
moving Y out the inner loop: 1.3488
removing inner loop: 0.0910
moving indexing out of loops: 0.0361
Luis' answer: 0.1417
Run Code Online (Sandbox Code Playgroud)
"路易斯的答案"就是这个.它对于小型阵列而言是最快的,但对于较大的阵列,它显示了置换的成本.将第一个产品的计算移出内部循环可以节省一半以上的计算成本.但是去掉内环会大大降低成本(我没想到,我认为单矩阵产品可以比许多小元素产品更好地使用并行性).然后,我们通过减少循环内的索引操作量来进一步减少时间.
这是时间码:
function so()
n = 10; p = 20;
%n = 50; p = 100;
X = randn(n,p,3);
T1 = method1(X);
T2 = method2(X);
T3 = method3(X);
T4 = method4(X);
T5 = method5(X);
assert(max(abs(T1(:)-T2(:)))<1e-13)
assert(max(abs(T1(:)-T3(:)))<1e-13)
assert(max(abs(T1(:)-T4(:)))<1e-13)
assert(max(abs(T1(:)-T5(:)))<1e-13)
timeit(@()method1(X))
timeit(@()method2(X))
timeit(@()method3(X))
timeit(@()method4(X))
timeit(@()method5(X))
function T = method1(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
for j = 1:p
for k = 1:p
T(i, j, k) = sum(X(:, i, 1) .* X(:, j, 2) .* X(:, k, 3));
end
end
end
function T = method2(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
for j = 1:p
Y = X(:, i, 1) .* X(:, j, 2);
for k = 1:p
T(i, j, k) = sum(Y .* X(:, k, 3));
end
end
end
function T = method3(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
for j = 1:p
Y = X(:, i, 1) .* X(:, j, 2);
T(i, j, :) = Y.' * X(:, :, 3);
end
end
function T = method4(X)
p = size(X,2);
T = zeros(p, p, p);
X1 = X(:, :, 1);
X2 = X(:, :, 2);
X3 = X(:, :, 3);
for i = 1:p
for j = 1:p
Y = X1(:, i) .* X2(:, j);
T(i, j, :) = Y.' * X3;
end
end
function T = method5(X)
T = sum(permute(X(:,:,1), [2 4 5 3 1]) .* permute(X(:,:,2), [4 2 5 3 1]) .* permute(X(:,:,3), [4 5 2 3 1]), 5);
Run Code Online (Sandbox Code Playgroud)