为什么cuFFT性能会因输入重叠而受损？

Question

为什么cuFFT性能会因输入重叠而受损？

我正在尝试使用cuFFT的回调功能来动态执行输入格式转换(例如,计算8位整数输入数据的FFT,而不先对输入缓冲区进行显式转换float).在我的许多应用程序中,我需要计算输入缓冲区上的重叠 FFT,如前面的SO问题所述.通常,相邻的FFT可能重叠FFT长度的1/4到1/8.

cuFFT具有类似FFTW的接口,通过函数的idist参数cufftPlanMany()显式支持.具体来说,如果我想计算大小为32768的FFT,并且在连续输入之间重叠4096个样本,我会设置idist = 32768 - 4096.这不,因为它得到正确的输出感正常工作.

但是,当我以这种方式使用cuFFT时,我看到了奇怪的性能下降.我设计了一个测试,它以两种不同的方式实现这种格式转换和重叠:

明确告诉cuFFT有关输入的重叠性质:idist = nfft - overlap如上所述设置.安装负载回调函数只是没有从转换int8_t到float根据需要提供给所述回叫缓冲指数.
不要告诉cuFFT关于输入的重叠性质; 对它说谎idist = nfft.然后,让回调函数通过计算应为每个FFT输入读取的正确索引来处理重叠.

这个GitHub要点提供了一个测试程序,该程序通过时序和等效测试实现这两种方法.为简洁起见,我没有在这里重现所有内容.该程序计算一批1024个32768点FFT,重叠4096个样本; 输入数据类型是8位整数.当我在我的机器上运行它(使用Geforce GTX 660 GPU,在Ubuntu 16.04上使用CUDA 8.0 RC)时,我得到以下结果:

executing method 1...done in 32.523 msec
executing method 2...done in 26.3281 msec

Run Code Online (Sandbox Code Playgroud)

方法2明显更快,我不指望.看一下回调函数的实现:

方法1:

template <typename T>
__device__ cufftReal convert_callback(void * inbuf, size_t fft_index, 
    void *, void *)
{
    return (cufftReal)(((const T *) inbuf)[fft_index]);
}

Run Code Online (Sandbox Code Playgroud)

方法2:

template <typename T>
__device__ cufftReal convert_and_overlap_callback(void *inbuf, 
    size_t fft_index, void *, void *)
{
    // fft_index is the index of the sample that we need, not taking 
    // the overlap into account. Convert it to the appropriate sample 
    // index, considering the overlap structure. First, grab the FFT 
    // parameters from constant memory.
    int nfft = overlap_params.nfft;
    int overlap = overlap_params.overlap;
    // Calculate which FFT in the batch that we're reading data for. This
    // tells us how much overlap we need to account for. Just use integer 
    // arithmetic here for speed, knowing that this would cause a problem 
    // if we did a batch larger than 2Gsamples long.
    int fft_index_int = fft_index;
    int fft_batch_index = fft_index_int / nfft;
    // For each transform past the first one, we need to slide "overlap" 
    // samples back in the input buffer when fetching the sample.
    fft_index_int -= fft_batch_index * overlap;
    // Cast the input pointer to the appropriate type and convert to a float.
    return (cufftReal) (((const T *) inbuf)[fft_index_int]);
}

Run Code Online (Sandbox Code Playgroud)

方法2具有明显更复杂的回调函数,甚至包括非编译时间值的整数除法!我希望这比方法1慢得多,但我看到相反的情况.对此有一个很好的解释吗？当输入重叠时,cuFFT是否可能以不同的方式构建其处理,从而导致性能下降？

如果可以从回调中删除索引计算(但是需要将重叠指定给cuFFT),我似乎应该能够实现比方法2快得多的性能.

编辑:运行我的测试程序后nvvp,我可以看到cuFFT肯定似乎是以不同的方式构造其计算.很难理解内核符号名称,但内核调用会像这样分解:

方法1:

__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex14packR2C_kernelIjfEEvNS_19spRealComplexR2C_stIT_T0_EE:3.72毫秒
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>:7.71毫秒
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>:12.75毫秒(是的,它被调用两次)
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_:7.49毫秒

方法2:

spRadix0128C::kernel1MemCallback<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, L1, ALL, WRITEBACK>:5.15毫秒
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>:12.88毫秒
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_:7.51毫秒

有趣的是,看起来cuFFT调用两个内核来实际使用方法1来计算FFT(当cuFFT知道重叠时),但是使用方法2(它不知道FFT是重叠的),它只使用方法完成工作一.对于在两种情况下使用的内核,它似乎在方法1和2之间使用相同的网格参数.

我不明白为什么它应该在这里使用不同的实现,特别是因为输入步幅istride == 1.在转换输入处获取数据时,它应该只使用不同的基址; 我认为算法的其余部分应该完全相同.

编辑2:我看到一些更奇怪的行为.我意外地意识到如果我没有适当地破坏cuFFT手柄,我会看到测量性能的差异.例如,我修改了测试程序以跳过cuFFT句柄的销毁,然后以不同的顺序执行测试:方法1,方法2,然后方法2和方法1.我得到了以下结果:

executing method 1...done in 31.5662 msec
executing method 2...done in 17.6484 msec
executing method 2...done in 17.7506 msec
executing method 1...done in 20.2447 msec

Run Code Online (Sandbox Code Playgroud)

因此,在为测试用例创建计划时,性能似乎会发生变化,具体取决于是否存在其他cuFFT计划!使用分析器,我看到内核启动的结构在两种情况之间没有变化; 内核似乎都执行得更快.我对此效果也没有合理的解释.

Answer 1

Jas*_*n R 1

根据 @llukas 的建议，我向 NVIDIA 提交了有关该问题的错误报告（如果您注册为开发人员，请访问https://partners.nvidia.com/bug/viewbug/1821802 ）。他们承认由于计划重叠导致绩效较差。他们实际上表示这两种情况下使用的内核配置都不是最佳的，他们计划最终改进这一点。没有给出预计到达时间，但它可能不会出现在下一个版本中（8.0 上周刚刚发布）。最后，他们表示，从 CUDA 8.0 开始，没有解决方法可以使 cuFFT 使用跨步输入的更有效方法。

归档时间：	9 年，2 月前
查看次数：	665 次
最近记录：	9 年，1 月前