NumPy/SciPy中的多线程整数矩阵乘法

éta*_*ogy 22 python multithreading numpy blas matrix-multiplication

做点什么

import numpy as np
a = np.random.rand(10**4, 10**4)
b = np.dot(a, a)
Run Code Online (Sandbox Code Playgroud)

使用多个核心,运行良好.

a但是,中的元素是64位浮点数(或32位平台中的32位?),我想将8位整数数组相乘.但尝试以下方法:

a = np.random.randint(2, size=(n, n)).astype(np.int8)
Run Code Online (Sandbox Code Playgroud)

导致dot产品不使用多个内核,因此在我的PC上运行速度慢约1000倍.

array: np.random.randint(2, size=shape).astype(dtype)

dtype    shape          %time (average)

float32 (2000, 2000)    62.5 ms
float32 (3000, 3000)    219 ms
float32 (4000, 4000)    328 ms
float32 (10000, 10000)  4.09 s

int8    (2000, 2000)    13 seconds
int8    (3000, 3000)    3min 26s
int8    (4000, 4000)    12min 20s
int8    (10000, 10000)  It didn't finish in 6 hours

float16 (2000, 2000)    2min 25s
float16 (3000, 3000)    Not tested
float16 (4000, 4000)    Not tested
float16 (10000, 10000)  Not tested
Run Code Online (Sandbox Code Playgroud)

我知道NumPy使用BLAS,它不支持整数,但如果我使用SciPy BLAS包装,即.

import scipy.linalg.blas as blas
a = np.random.randint(2, size=(n, n)).astype(np.int8)
b = blas.sgemm(alpha=1.0, a=a, b=a)
Run Code Online (Sandbox Code Playgroud)

计算多线程的.现在,blas.sgemm运行的时间与np.dotfloat32的时间完全相同,但是对于非浮点数,它会将所有内容转换为float32输出浮点数,这是不可行的np.dot.(此外,b现在是F_CONTIGUOUS有序的,这是一个较小的问题).

所以,如果我想进行整数矩阵乘法,我必须做以下其中一项:

  1. 使用NumPy的速度np.dot很慢,很高兴能保持8位整数.
  2. 使用SciPy sgemm并使用4倍内存.
  3. 使用Numpy np.float16并且只使用2x内存,np.dot浮点数在float16数组上比在float32数组上慢得多,比int8更多.
  4. 找到一个优化的多线程整数矩阵乘法库(实际上,Mathematica这样做,但我更喜欢Python解决方案),理想情况下支持1位数组,虽然8位数组也很好...(我是实际上我的目标是在有限域Z/2Z上进行矩阵的乘法运算,我知道我可以用Sage做到这一点,Sage是相当Pythonic的,但是,再次,是否有严格的Python?)

我可以按照选项4吗?这样的图书馆存在吗?

免责声明:我实际上是在运行NumPy + MKL,但我在vanilly NumPy上尝试了类似的测试,结果相似.

kaz*_*ase 6

请注意,虽然这个答案变旧了 numpy 可能会获得优化的整数支持。请验证此答案在您的设置中是否仍能更快地工作。

  • 选项 5 - 推出自定义解决方案:将矩阵产品划分为几个子产品并并行执行。使用标准 Python 模块可以相对容易地实现这一点。子产品使用 计算numpy.dot,这会释放全局解释器锁。因此,可以使用相对轻量级的线程,并且可以从主线程访问数组以提高内存效率。

执行:

import numpy as np
from numpy.testing import assert_array_equal
import threading
from time import time


def blockshaped(arr, nrows, ncols):
    """
    Return an array of shape (nrows, ncols, n, m) where
    n * nrows, m * ncols = arr.shape.
    This should be a view of the original array.
    """
    h, w = arr.shape
    n, m = h // nrows, w // ncols
    return arr.reshape(nrows, n, ncols, m).swapaxes(1, 2)


def do_dot(a, b, out):
    #np.dot(a, b, out)  # does not work. maybe because out is not C-contiguous?
    out[:] = np.dot(a, b)  # less efficient because the output is stored in a temporary array?


def pardot(a, b, nblocks, mblocks, dot_func=do_dot):
    """
    Return the matrix product a * b.
    The product is split into nblocks * mblocks partitions that are performed
    in parallel threads.
    """
    n_jobs = nblocks * mblocks
    print('running {} jobs in parallel'.format(n_jobs))

    out = np.empty((a.shape[0], b.shape[1]), dtype=a.dtype)

    out_blocks = blockshaped(out, nblocks, mblocks)
    a_blocks = blockshaped(a, nblocks, 1)
    b_blocks = blockshaped(b, 1, mblocks)

    threads = []
    for i in range(nblocks):
        for j in range(mblocks):
            th = threading.Thread(target=dot_func, 
                                  args=(a_blocks[i, 0, :, :], 
                                        b_blocks[0, j, :, :], 
                                        out_blocks[i, j, :, :]))
            th.start()
            threads.append(th)

    for th in threads:
        th.join()

    return out


if __name__ == '__main__':
    a = np.ones((4, 3), dtype=int)
    b = np.arange(18, dtype=int).reshape(3, 6)
    assert_array_equal(pardot(a, b, 2, 2), np.dot(a, b))

    a = np.random.randn(1500, 1500).astype(int)

    start = time()
    pardot(a, a, 2, 4)
    time_par = time() - start
    print('pardot: {:.2f} seconds taken'.format(time_par))

    start = time()
    np.dot(a, a)
    time_dot = time() - start
    print('np.dot: {:.2f} seconds taken'.format(time_dot))
    
Run Code Online (Sandbox Code Playgroud)

通过这个实现,我得到了大约 x4 的加速,这是我机器中的物理内核数:

running 8 jobs in parallel
pardot: 5.45 seconds taken
np.dot: 22.30 seconds taken
Run Code Online (Sandbox Code Playgroud)