éta*_*ogy 22 python multithreading numpy blas matrix-multiplication
做点什么
import numpy as np
a = np.random.rand(10**4, 10**4)
b = np.dot(a, a)
Run Code Online (Sandbox Code Playgroud)
使用多个核心,运行良好.
a但是,中的元素是64位浮点数(或32位平台中的32位?),我想将8位整数数组相乘.但尝试以下方法:
a = np.random.randint(2, size=(n, n)).astype(np.int8)
Run Code Online (Sandbox Code Playgroud)
导致dot产品不使用多个内核,因此在我的PC上运行速度慢约1000倍.
array: np.random.randint(2, size=shape).astype(dtype)
dtype shape %time (average)
float32 (2000, 2000) 62.5 ms
float32 (3000, 3000) 219 ms
float32 (4000, 4000) 328 ms
float32 (10000, 10000) 4.09 s
int8 (2000, 2000) 13 seconds
int8 (3000, 3000) 3min 26s
int8 (4000, 4000) 12min 20s
int8 (10000, 10000) It didn't finish in 6 hours
float16 (2000, 2000) 2min 25s
float16 (3000, 3000) Not tested
float16 (4000, 4000) Not tested
float16 (10000, 10000) Not tested
Run Code Online (Sandbox Code Playgroud)
我知道NumPy使用BLAS,它不支持整数,但如果我使用SciPy BLAS包装,即.
import scipy.linalg.blas as blas
a = np.random.randint(2, size=(n, n)).astype(np.int8)
b = blas.sgemm(alpha=1.0, a=a, b=a)
Run Code Online (Sandbox Code Playgroud)
计算是多线程的.现在,blas.sgemm运行的时间与np.dotfloat32的时间完全相同,但是对于非浮点数,它会将所有内容转换为float32输出浮点数,这是不可行的np.dot.(此外,b现在是F_CONTIGUOUS有序的,这是一个较小的问题).
所以,如果我想进行整数矩阵乘法,我必须做以下其中一项:
np.dot很慢,很高兴能保持8位整数.sgemm并使用4倍内存.np.float16并且只使用2x内存,np.dot浮点数在float16数组上比在float32数组上慢得多,比int8更多.我可以按照选项4吗?这样的图书馆存在吗?
免责声明:我实际上是在运行NumPy + MKL,但我在vanilly NumPy上尝试了类似的测试,结果相似.
请注意,虽然这个答案变旧了 numpy 可能会获得优化的整数支持。请验证此答案在您的设置中是否仍能更快地工作。
numpy.dot,这会释放全局解释器锁。因此,可以使用相对轻量级的线程,并且可以从主线程访问数组以提高内存效率。执行:
import numpy as np
from numpy.testing import assert_array_equal
import threading
from time import time
def blockshaped(arr, nrows, ncols):
"""
Return an array of shape (nrows, ncols, n, m) where
n * nrows, m * ncols = arr.shape.
This should be a view of the original array.
"""
h, w = arr.shape
n, m = h // nrows, w // ncols
return arr.reshape(nrows, n, ncols, m).swapaxes(1, 2)
def do_dot(a, b, out):
#np.dot(a, b, out) # does not work. maybe because out is not C-contiguous?
out[:] = np.dot(a, b) # less efficient because the output is stored in a temporary array?
def pardot(a, b, nblocks, mblocks, dot_func=do_dot):
"""
Return the matrix product a * b.
The product is split into nblocks * mblocks partitions that are performed
in parallel threads.
"""
n_jobs = nblocks * mblocks
print('running {} jobs in parallel'.format(n_jobs))
out = np.empty((a.shape[0], b.shape[1]), dtype=a.dtype)
out_blocks = blockshaped(out, nblocks, mblocks)
a_blocks = blockshaped(a, nblocks, 1)
b_blocks = blockshaped(b, 1, mblocks)
threads = []
for i in range(nblocks):
for j in range(mblocks):
th = threading.Thread(target=dot_func,
args=(a_blocks[i, 0, :, :],
b_blocks[0, j, :, :],
out_blocks[i, j, :, :]))
th.start()
threads.append(th)
for th in threads:
th.join()
return out
if __name__ == '__main__':
a = np.ones((4, 3), dtype=int)
b = np.arange(18, dtype=int).reshape(3, 6)
assert_array_equal(pardot(a, b, 2, 2), np.dot(a, b))
a = np.random.randn(1500, 1500).astype(int)
start = time()
pardot(a, a, 2, 4)
time_par = time() - start
print('pardot: {:.2f} seconds taken'.format(time_par))
start = time()
np.dot(a, a)
time_dot = time() - start
print('np.dot: {:.2f} seconds taken'.format(time_dot))
Run Code Online (Sandbox Code Playgroud)
通过这个实现,我得到了大约 x4 的加速,这是我机器中的物理内核数:
running 8 jobs in parallel
pardot: 5.45 seconds taken
np.dot: 22.30 seconds taken
Run Code Online (Sandbox Code Playgroud)