相关疑难解决方法(0)

CUDA - 为什么基于warp的并行减少更慢？

我有关于基于warp的并行缩减的想法,因为warp的所有线程都是按照定义同步的.

因此,我们的想法是输入数据可以减少64倍(每个线程减少两个元素),而不需要任何同步.

与Mark Harris的原始实现相同,减少应用于块级别,数据应用于共享内存. http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf

我创建了一个内核来测试他的版本和基于warp的版本.
内核本身完全相同地将BLOCK_SIZE元素存储在共享内存中,并将其结果输出到输出数组中的唯一块索引.

算法本身工作正常.测试完整的一个数组以测试"计数".

实现的功能体:

/**
 * Performs a parallel reduction with operator add 
 * on the given array and writes the result with the thread 0
 * to the given target value
 *
 * @param inValues T* Input float array, length must be a multiple of 2 and equal to blockDim.x
 * @param targetValue float 
 */
__device__ void reductionAddBlockThread_f(float* inValues,
    float &outTargetVar)
{
    // code of the below functions
}

Run Code Online (Sandbox Code Playgroud)

1.执行他的版本: