小编kay*_*ist的帖子

从 Python Numba CUDA 内核调用的加速 FFT

我需要计算一个 256 个元素的 float64 信号的傅立叶变换。要求是我需要从 cuda.jitted 部分内部调用这些 FFT，并且必须在 25 秒内完成。唉，cuda.jit 编译的函数不允许调用外部库 => 我自己写的。唉，我的单核代码还是太慢了（在 Quadro P4000 上约为 250 微秒）。有没有更好的办法？

我创建了一个可以提供正确结果的单核 FFT 函数，但是速度太慢了 10 倍。我不明白如何充分利用多核。

---fft.py

from numba import cuda, boolean, void, int32, float32, float64, complex128
import math, sys, cmath

def _transform_radix2(vector, inverse, out):    
    n = len(vector) 
    levels = int32(math.log(float32(n))/math.log(float32(2)))

    assert 2**levels==n # error: Length is not a power of 2 

    #uncomment either Numba.Cuda or Numpy memory allocation, (intelligent conditional compileation??)               
    exptable = cuda.local.array(1024, dtype=complex128)   
    #exptable = np.zeros(1024, np.complex128)

    assert (n // …

Run Code Online (Sandbox Code Playgroud)

python jit cuda fft numba

kay*_*ist

2019 06-26

2
推荐指数

1
解决办法

1726
查看次数