在Cython中分配中间多维数组而不获取GIL

ali*_*i_m 10 python parallel-processing numpy cython thread-local-storage

我正在尝试使用Cython来并行化一个昂贵的操作,这涉及生成中间多维数组.

以下非常简化的代码说明了我正在尝试做的事情:

import numpy as np
cimport cython
cimport numpy as np
from cython.parallel cimport prange
from libc.stdlib cimport malloc, free


@cython.boundscheck(False)
@cython.wraparound(False)
def embarrasingly_parallel_example(char[:, :] A):

    cdef unsigned int m = A.shape[0]
    cdef unsigned int n = A.shape[1]
    cdef np.ndarray[np.float64_t, ndim = 2] out = np.empty((m, m), np.float64)
    cdef unsigned int ii, jj
    cdef double[:, :] tmp

    for ii in prange(m, nogil=True):
        for jj in range(m):

            # allocate a temporary array to hold the result of
            # expensive_function_1
            tmp_carray = <double * > malloc((n ** 2) * sizeof(double))

            # a 2D typed memoryview onto tmp_carray
            tmp = <double[:n, :n] > tmp_carray

            # shove the intermediate result in tmp
            expensive_function_1(A[ii, :], A[jj, :], tmp)

            # get the final (scalar) output for this ii, jj
            out[ii, jj] = expensive_function_2(tmp)

            # free the intermediate array
            free(tmp_carray)

    return out


# some silly examples - the actual operation I'm performing is a lot more
# involved
# ------------------------------------------------------------------------
@cython.boundscheck(False)
@cython.wraparound(False)
cdef void expensive_function_1(char[:] x, char[:] y, double[:, :] tmp):

    cdef unsigned int m = tmp.shape[0]
    cdef unsigned int n = x.shape[0]
    cdef unsigned int ii, jj

    for ii in range(m):
        for jj in range(m):
            tmp[ii, jj] = 0
            for kk in range(n):
                tmp[ii, jj] += (x[kk] + y[kk]) * (ii - jj)


@cython.boundscheck(False)
@cython.wraparound(False)
cdef double expensive_function_2(double[:, :] tmp):

    cdef unsigned int m = tmp.shape[0]
    cdef unsigned int ii, jj
    cdef double result = 0

    for ii in range(m):
        for jj in range(m):
            result += tmp[ii, jj]

    return result
Run Code Online (Sandbox Code Playgroud)

似乎至少有两个原因导致无法编译:

  1. 根据输出cython -a,在此处创建类型化内存视图:

    cdef double[:, :] tmp = <double[:n, :n] > tmp_carray
    
    Run Code Online (Sandbox Code Playgroud)

    似乎涉及Python API调用,因此我无法释放GIL以允许外部循环并行运行.

    我的印象是键入的内存视图不是Python对象,因此子进程应该能够在不首先获取GIL的情况下创建它们.是这样的吗?

2.即使我prange(m, nogil=True)用普通替换range(m),Cython仍然似乎不喜欢cdef内循环内的存在:

    Error compiling Cython file:
    ------------------------------------------------------------
    ...
                # allocate a temporary array to hold the result of
                # expensive_function_1
                tmp_carray = <double*> malloc((n ** 2) * sizeof(double))

                # a 2D typed memoryview onto tmp_carray
                cdef double[:, :] tmp = <double[:n, :n]> tmp_carray
                    ^
    ------------------------------------------------------------

    parallel_allocate.pyx:26:17: cdef statement not allowed here
Run Code Online (Sandbox Code Playgroud)

更新

事实证明,第二个问题很容易通过移动来解决

 cdef double[:, :] tmp
Run Code Online (Sandbox Code Playgroud)

for循环之外,只是分配

 tmp = <double[:n, :n] > tmp_carray
Run Code Online (Sandbox Code Playgroud)

循环内.不过,我仍然不完全理解为什么这是必要的.

现在,如果我尝试使用,prange我点击以下编译错误:

Error compiling Cython file:
------------------------------------------------------------
...
            # allocate a temporary array to hold the result of
            # expensive_function_1
            tmp_carray = <double*> malloc((n ** 2) * sizeof(double))

            # a 2D typed memoryview onto tmp_carray
            tmp = <double[:n, :n]> tmp_carray
               ^
------------------------------------------------------------

parallel_allocate.pyx:28:16: Memoryview slices can only be shared in parallel sections
Run Code Online (Sandbox Code Playgroud)

hiv*_*ert 6

免责声明:这里的所有东西都要带上一粒盐.我更想猜知道.你当然应该在Cython-User提问.他们总是友好而快速回答.

我同意Cython的文档不是很清楚:

[...]记忆视图通常不需要GIL:

cpdef int sum3d(int [:,:,:] arr)nogil:...

特别是,您不需要GIL进行内存视图索引,切片或转置.Memoryview需要GIL用于复制方法(C和Fortran连续副本),或者当dtype是对象并且读取或写入对象元素时.

我认为这意味着传递内存视图参数,或者使用它进行切片或转置不需要Python GIL.但是,创建内存视图或复制内存视图需要GIL.

支持这一点的另一个论点是,Cython函数可以将内存视图返回给Python.

from cython.view cimport array as cvarray
import numpy as np

def bla():
    narr = np.arange(27, dtype=np.dtype("i")).reshape((3, 3, 3))
    cdef int [:, :, :] narr_view = narr
    return narr_view
Run Code Online (Sandbox Code Playgroud)

得到:

>>> import hello
>>> hello.bla()
<MemoryView of 'ndarray' at 0x1b03380>
Run Code Online (Sandbox Code Playgroud)

这意味着内存视图在Python的GC管理内存中分配,因此需要创建GIL.所以你不能在nogil部分创建一个内存视图


现在关注错误消息

Memoryview切片只能在并行部分中共享

我认为你应该把它读作"你不能拥有一个线程专用的memoryview切片.它必须是一个线程共享的memoryview切片".