Ben*_*man 9 python product numpy multidimensional-array
我得到了一些我无法解释的效率测试结果.
我想组装一个矩阵B,其第i个条目B [i,:,:] = A [i,:,:].dot(x),其中每个A [i,:,:]是一个2D矩阵, x也是如此.
我可以这三种方式来测试性能我做的随机(numpy.random.randn)矩阵A =(10,1000,1000),x =(1000,1200).我得到以下时间结果:
(1)单个多维点积
B = A.dot(x)
total time: 102.361 s
Run Code Online (Sandbox Code Playgroud)
(2)循环通过i并执行2D点积
# initialize B = np.zeros([dim1, dim2, dim3])
for i in range(A.shape[0]):
B[i,:,:] = A[i,:,:].dot(x)
total time: 0.826 s
Run Code Online (Sandbox Code Playgroud)
(3)numpy.einsum
B3 = np.einsum("ijk, kl -> ijl", A, x)
total time: 8.289 s
Run Code Online (Sandbox Code Playgroud)
因此,选项(2)是迄今为止最快的.但是,仅考虑(1)和(2),我看不出它们之间的巨大差异.如何循环和做2D点产品的速度要快124倍?他们都使用numpy.dot.任何见解?
我在下面包含了用于上述结果的代码:
import numpy as np
import numpy.random as npr
import time
dim1, dim2, dim3 = 10, 1000, 1200
A = npr.randn(dim1, dim2, dim2)
x = npr.randn(dim2, dim3)
# consider three ways of assembling the same matrix B: B1, B2, B3
t = time.time()
B1 = np.dot(A,x)
td1 = time.time() - t
print "a single dot product of A [shape = (%d, %d, %d)] with x [shape = (%d, %d)] completes in %.3f s" \
% (A.shape[0], A.shape[1], A.shape[2], x.shape[0], x.shape[1], td1)
B2 = np.zeros([A.shape[0], x.shape[0], x.shape[1]])
t = time.time()
for i in range(A.shape[0]):
B2[i,:,:] = np.dot(A[i,:,:], x)
td2 = time.time() - t
print "taking %d dot products of 2D dot products A[i,:,:] [shape = (%d, %d)] with x [shape = (%d, %d)] completes in %.3f s" \
% (A.shape[0], A.shape[1], A.shape[2], x.shape[0], x.shape[1], td2)
t = time.time()
B3 = np.einsum("ijk, kl -> ijl", A, x)
td3 = time.time() - t
print "using np.einsum, it completes in %.3f s" % td3
Run Code Online (Sandbox Code Playgroud)
numpy.dot仅当每个输入的维度最多为 2 时才委托给BLAS矩阵乘法:
#if defined(HAVE_CBLAS)
if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
(NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
return cblas_matrixproduct(typenum, ap1, ap2, out);
}
#endif
Run Code Online (Sandbox Code Playgroud)
当您将整个 3 维A数组放入 中时dot,NumPy 会采用较慢的路径,穿过一个nditer对象。它仍然尝试在慢速路径中利用 BLAS ,但是慢速路径的设计方式,它只能使用向量-向量乘法而不是矩阵-矩阵乘法,这不会给 BLAS 带来任何接近的结果。尽可能多的优化空间。
使用较小的暗淡10,100,200,我得到类似的排名
In [355]: %%timeit
.....: B=np.zeros((N,M,L))
.....: for i in range(N):
B[i,:,:]=np.dot(A[i,:,:],x)
.....:
10 loops, best of 3: 22.5 ms per loop
In [356]: timeit np.dot(A,x)
10 loops, best of 3: 44.2 ms per loop
In [357]: timeit np.einsum('ijk,km->ijm',A,x)
10 loops, best of 3: 29 ms per loop
In [367]: timeit np.dot(A.reshape(-1,M),x).reshape(N,M,L)
10 loops, best of 3: 22.1 ms per loop
In [375]: timeit np.tensordot(A,x,(2,0))
10 loops, best of 3: 22.2 ms per loop
Run Code Online (Sandbox Code Playgroud)
迭代速度更快,尽管没有您的情况那么快。
只要迭代维度与其他维度相比较小,这可能是正确的。在这种情况下,迭代的开销(函数调用等)与计算时间相比很小。一次执行所有值会占用更多内存。
我尝试了一种dot变体,将其重塑A为二维,认为这dot可以在内部进行这种重塑。我很惊讶它实际上是最快的。 tensordot可能正在做相同的重塑(如果Python可读的话,该代码)。
einsum设置涉及 4 个变量的“乘积和”迭代,即i,j,k,mCdim1*dim2*dim2*dim3级别的步骤nditer。因此,索引越多,迭代空间就越大。
| 归档时间: |
|
| 查看次数: |
1800 次 |
| 最近记录: |