cuda 的向量化,一个以复数作为输入,一个复数作为输出的函数在 numba 中失败

Tec*_*Guy 7 cuda cpython vectorization numba

我使用了一个程序来绘制曼德布罗图,并使用 njit 让它在 CPU 线程上运行。现在我想生成一个 32k 的图像,但即使是整个线程也太慢了。所以我试图让代码在 GPU 上运行。这是代码:

from numba import njit, cuda, vectorize
from PIL import Image, ImageDraw


@vectorize(['complex128(complex128)'], target='cuda')
def mandelbrot(c):

    z = 0
    n = 0
    while abs(z) <= 2 and n < 80:
        z = z*z + c
        n += 1
    return n


def vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw):

    for x in range(0, WIDTH):

        for y in range(0, HEIGHT):

            print(x)
            # Convert pixel coordinate to complex number
            c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),
                        IM_START + (y / HEIGHT) * (IM_END - IM_START))
            # Compute the number of iterations
            m = mandelbrot(c)
            # The color depends on the number of iterations
            color = 255 - int(m * 255 / 80)
            # Plot the point
            draw.point([x, y], (color, color, color))


def vai():
    # Image size (pixels)
    WIDTH = 15360
    HEIGHT = 8640

    # Plot window
    RE_START = -2
    RE_END = 1
    IM_START = -1
    IM_END = 1

    palette = []

    im = Image.new('RGB', (WIDTH, HEIGHT), (0, 0, 0))
    draw = ImageDraw.Draw(im)
    vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw )

    im.save('output.png', 'PNG')

vai()
Run Code Online (Sandbox Code Playgroud)

这是错误:

D:\anaconda\python.exe C:/Users/techguy/PycharmProjects/mandelbrot/main.py
0
Traceback (most recent call last):
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 56, in <module>
    vai()
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 52, in vai
    vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw )
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 30, in vari
    m = mandelbrot(c)
  File "D:\anaconda\lib\site-packages\numba\cuda\dispatcher.py", line 41, in __call__
    return CUDAUFuncMechanism.call(self.functions, args, kws)
  File "D:\anaconda\lib\site-packages\numba\np\ufunc\deviceufunc.py", line 301, in call
    cr.launch(func, shape[0], stream, devarys)
  File "D:\anaconda\lib\site-packages\numba\cuda\dispatcher.py", line 152, in launch
    func.forall(count, stream=stream)(*args)
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 372, in __call__
    kernel = self.kernel.specialize(*args)
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 881, in specialize
    specialization = Dispatcher(self.py_func, [types.void(*argtypes)],
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 808, in __init__
    self.compile(sigs[0])
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 935, in compile
    kernel.bind()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 576, in bind
    self._func.get()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 446, in get
    ptx = self.ptx.get()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 414, in get
    arch = nvvm.get_arch_option(*cc)
  File "D:\anaconda\lib\site-packages\numba\cuda\cudadrv\nvvm.py", line 345, in get_arch_option
    return 'compute_%d%d' % arch
TypeError: not enough arguments for format string

Process finished with exit code 1
Run Code Online (Sandbox Code Playgroud)

如果我@vectorize@njit(nogil=true)它代替它工作正常但它在 CPU 上运行。我绝对需要它在 GPU 上运行。我认为问题类似于复杂类型。
问题是什么?

代码不是我的:我在How to plot the Mandelbrot set 中 找到了它。

我只是修改了一些片段。

这是一个最小的可重现示例:

D:\anaconda\python.exe C:/Users/techguy/PycharmProjects/mandelbrot/main.py
0
Traceback (most recent call last):
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 56, in <module>
    vai()
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 52, in vai
    vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw )
  File "C:/Users/techguy/PycharmProjects/mandelbrot/main.py", line 30, in vari
    m = mandelbrot(c)
  File "D:\anaconda\lib\site-packages\numba\cuda\dispatcher.py", line 41, in __call__
    return CUDAUFuncMechanism.call(self.functions, args, kws)
  File "D:\anaconda\lib\site-packages\numba\np\ufunc\deviceufunc.py", line 301, in call
    cr.launch(func, shape[0], stream, devarys)
  File "D:\anaconda\lib\site-packages\numba\cuda\dispatcher.py", line 152, in launch
    func.forall(count, stream=stream)(*args)
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 372, in __call__
    kernel = self.kernel.specialize(*args)
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 881, in specialize
    specialization = Dispatcher(self.py_func, [types.void(*argtypes)],
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 808, in __init__
    self.compile(sigs[0])
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 935, in compile
    kernel.bind()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 576, in bind
    self._func.get()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 446, in get
    ptx = self.ptx.get()
  File "D:\anaconda\lib\site-packages\numba\cuda\compiler.py", line 414, in get
    arch = nvvm.get_arch_option(*cc)
  File "D:\anaconda\lib\site-packages\numba\cuda\cudadrv\nvvm.py", line 345, in get_arch_option
    return 'compute_%d%d' % arch
TypeError: not enough arguments for format string

Process finished with exit code 1
Run Code Online (Sandbox Code Playgroud)

whn*_*whn 7

您表现出对 vectorize 的功能缺乏非常基本的了解,更不用说 cuda。在你看这个答案之前,你应该在这里阅读:https : //numba.pydata.org/numba-doc/dev/user/vectorize.html

您似乎缺少基本信息,例如,numba 上下文之外的矢量化通常意味着什么?向量意味着我们正在对某个数组又名向量输入运行SIMD操作。看看你的代码:

@vectorize(['complex128(complex128)'], target='cuda')
def mandelbrot(c):

    z = 0
    n = 0
    while abs(z) <= 2 and n < 80:
        z = z*z + c
        n += 1
    return n
Run Code Online (Sandbox Code Playgroud)

当您添加该装饰器时,您将此函数转换为矢量化版本。没有装饰器,它需要一个标量值,即单个复数值。当您转换它时,mandebrot 将需要一个向量,以便每个值都可以*并行运行。那么你能发现你刚刚在这里创建的函数的大量滥用吗?

def vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw):

    for x in range(0, WIDTH):

        for y in range(0, HEIGHT):

            print(x)
            # Convert pixel coordinate to complex number
            c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),
                        IM_START + (y / HEIGHT) * (IM_END - IM_START))
            # Compute the number of iterations
            m = mandelbrot(c)
            # The color depends on the number of iterations
            color = 255 - int(m * 255 / 80)
            # Plot the point
            draw.point([x, y], (color, color, color))
Run Code Online (Sandbox Code Playgroud)

您的 mandelbrot 函数在循环内对标量值进行操作。换句话说,你用你矢量化功能不正确,并在最坏的可能的方式。看看这个转换后的代码:

def vari(WIDTH, HEIGHT, RE_START, RE_END, IM_START, IM_END, draw):

    complex_mat = np.empty((HEIGHT, WIDTH), dtype=np.complex128)
    for x in range(0, WIDTH):
        for y in range(0, HEIGHT):
            print(x)
            # Convert pixel coordinate to complex number
            c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),
                        IM_START + (y / HEIGHT) * (IM_END - IM_START))
            complex_mat[y,x] = c


    # Compute the number of iterations
    m = mandelbrot(complex_mat)
    for x in range(0, WIDTH):
        for y in range(0, HEIGHT):
            # The color depends on the number of iterations
            color = 255 - int(m[y,x] * 255 / 80)
            # Plot the point
            draw.point([x, y], (color, color, color))
Run Code Online (Sandbox Code Playgroud)

我们首先创建要输入到“向量化函数”中的“向量”,在这种情况下,任何 numpy 数组都应该做,它只会以相同的形状输出按元素应用。

现在你仍然会看到这段代码很慢。同样,还有另一个非常基本的缺乏理解,这表明缺乏先前的研究。我建议您对此代码进行基准测试,并且在您向 SO 寻求有关如何提高速度的建议之前这样做。您可能会发现它甚至不是直接导致速度变慢的“mandelbrot”代码。你所做的其他一切仍然序列化的。您需要将复数生成 mandelbrot点生成移动到 GPU 上。我不确定如何使用 numba 来做到这一点,但这远远超出了您的问题范围,这可能有用,

https://github.com/numba/numba/issues/4309

看起来你会想要使用内置的 cuda 并行化工具而不是向量化来确保你不必将无用的数据传递给 GPU(即,你可以只迭代你需要生成值的像素,而不是传递CUDA 的像素索引)。

除了在 CPU 和 GPU 之间来回传递大量数据之外,代码运行缓慢的另一个原因是使用了 complex128。GPU 有时没有“快速”双精度,特别是 Nvidia 倾向于将消费级 GPU 的双精度性能降低到双精度可以是浮点速度的 1/32 的程度。这是相关的,因为 complex128 实际上是 2 个粘在一起的双精度值。complex64 可能会提供更好的速度。在这个实验中您不太可能遇到精度较低的问题,当您放大 mandelbrot 集时,您可能会遇到精度错误。有一些技术可以通过无缝“包装”计算 mandelbrot 集的函数来解决这个问题,以防止这些伪影。

最后,当我运行修改后的代码时,它运行良好。换句话说,我没有

  File "D:\anaconda\lib\site-packages\numba\cuda\cudadrv\nvvm.py", line 345, in get_arch_option
    return 'compute_%d%d' % arch
TypeError: not enough arguments for format string
Run Code Online (Sandbox Code Playgroud)

错误。如果您在运行我修改后的版本时仍然出现此错误,那么您还有一些其他配置错误,由于缺乏研究,该错误太宽泛且超出了本问题的范围,例如,它可能与“did您安装了 cuda”,但如果没有更集中的问题,我们就无法知道。这是我生成的输出(更小,以便它符合 SO 的大小要求)。注意我并没有更换

@vectorize(['complex128(complex128)'], target='cuda')
Run Code Online (Sandbox Code Playgroud)

@vectorize(['int32(complex128)'], target='cuda')
Run Code Online (Sandbox Code Playgroud)

不是您问题的适当解决方案。这再次指向一些用户特定的配置错误。

在此处输入图片说明


Tec*_*Guy 6

问题已通过更换解决

@vectorize(['complex128(complex128)'], target='cuda')
Run Code Online (Sandbox Code Playgroud)

@vectorize(['int32(complex128)'], target='cuda')
Run Code Online (Sandbox Code Playgroud)

这并不意味着性能更好:它更糟。我认为这是因为该程序不可并行化。唯一使性能更好的是使用

@njit(nogil=True)
Run Code Online (Sandbox Code Playgroud)

真正的问题是我没有cudatoolkit安装。我正在使用anaconda. 这是一个简单的修复:

conda install cudatoolkit
Run Code Online (Sandbox Code Playgroud)