Ped*_*raz 0 python performance numpy cython
我正在尝试优化一些执行大量顺序矩阵运算的代码.
我想numpy.linalg.multi_dot(这里的文档)将执行C或BLAS中的所有操作,因此它会比执行类似arr1.dot(arr2).dot(arr3)等等更快.
我真的很惊讶在笔记本上运行这段代码:
v1 = np.random.rand(2,2)
v2 = np.random.rand(2,2)
%%timeit
?
v1.dot(v2.dot(v1.dot(v2)))
The slowest run took 9.01 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.14 µs per loop
%%timeit ?
np.linalg.multi_dot([v1,v2,v1,v2])
The slowest run took 4.67 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.9 µs per loop
Run Code Online (Sandbox Code Playgroud)
要发现相同的操作使用速度慢了10倍multi_dot.
我的问题是:
这是因为你的测试矩阵太小而且太规律; 确定最快评估订单的开销可能超过潜在的性能增益.
使用文档中的示例:
import numpy as snp
from numpy.linalg import multi_dot
# Prepare some data
A = np.random.rand(10000, 100)
B = np.random.rand(100, 1000)
C = np.random.rand(1000, 5)
D = np.random.rand(5, 333)
%timeit -n 10 multi_dot([A, B, C, D])
%timeit -n 10 np.dot(np.dot(np.dot(A, B), C), D)
%timeit -n 10 A.dot(B).dot(C).dot(D)
Run Code Online (Sandbox Code Playgroud)
结果:
10 loops, best of 3: 12 ms per loop
10 loops, best of 3: 62.7 ms per loop
10 loops, best of 3: 59 ms per loop
Run Code Online (Sandbox Code Playgroud)
multi_dot 通过评估标量乘法最少的最快乘法顺序来提高性能.
在上面的例子中,默认的常规乘法顺序((AB)C)D被评估为A((BC)D)- 即1000x100 @ 100x1000乘法减少到1000x100 @ 100x333,至少减少2/3标量乘法.
您可以通过测试来验证这一点
%timeit -n 10 np.dot(A, np.dot(np.dot(B, C), D))
10 loops, best of 3: 19.2 ms per loop
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
880 次 |
| 最近记录: |