使用 CUDA 减少排列在大向量中的多个等长块

Question

使用 CUDA 减少排列在大向量中的多个等长块

ala*_*lae 1 cuda gpgpu nvidia reduction thrust

我正在寻找一种快速方法来减少排列为大向量的多个相同长度的块。我有 N 个子数组（连续元素），它们排列在一个大数组中。每个子数组都有固定的大小：k。所以整个数组的大小是：N*K

我正在做的是调用内核N次。每次它计算子数组的减少如下：我将迭代大向量中包含的所有子数组：

    for(i=0;i<N;i++){
       thrust::device_vector< float > Vec(subarray, subarray+k);
       float sum = thrust::reduce(Vec.begin(), Vec.end(), (float)0, thrust::plus<float>());
       printf("sum %f\n",sum);
 }

Run Code Online (Sandbox Code Playgroud)

对于纯 CUDA 我会这样做（伪代码）：

 for(i=0;i<N;i++){
        reduction_kernel(subarray)

         }

Run Code Online (Sandbox Code Playgroud)

您是否有另一种解决方案来一次性执行连续子数组的缩减？使用纯 CUDA 或 Thrust

Answer 1

Rob*_*lla 5

你要求的是分段减少。这可以在推力中使用thrust::reduce_by_key 除了长度为 N*K 的数据向量之外，我们还需要一个定义每个段的“键”向量 - 段不必具有相同的大小，只要键向量区分线段，如下所示：

data:  1 3 2 3 1 4 2 3 2 1 4 2 ...
keys:  0 0 0 1 1 1 0 0 0 3 3 3 ...
seg:   0 0 0 1 1 1 2 2 2 3 3 3 ...

Run Code Online (Sandbox Code Playgroud)

每当键序列发生变化时，键都会描绘出一个新的段（请注意，在上面的示例中，我有两个单独的段，它们是使用相同的键来描绘的 - Thrust 不会将这些段组合在一起，而是单独处理它们，因为有 1 个或多个不同的干预键值）。您实际上没有这些数据，但为了速度和效率，由于您的段长度相等，我们可以使用推力奇特迭代器的组合“即时”生成必要的键序列。

奇特的迭代器将结合起来：

产生线性序列 0 1 2 3 ...（通过counting_iterator）
将线性序列的每个成员除以K，即段长度（通过transform_iterator）。我在这里使用推力占位符方法，因此我不必为变换迭代器编写函子。

这将产生必要的段密钥序列。

这是一个有效的例子：

$ cat t1282.cu
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/copy.h>
#include <thrust/execution_policy.h>
#include <iostream>

const int N = 1000;  // sequences
const int K = 100;   // length of sequence
typedef int mytype;

using namespace thrust::placeholders;

int main(){

  thrust::device_vector<mytype> data(N*K, 1);
  thrust::device_vector<mytype> sums(N);
  thrust::reduce_by_key(thrust::device, thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1/K), thrust::make_transform_iterator(thrust::counting_iterator<int>(N*K), _1/K), data.begin(), thrust::discard_iterator<int>(), sums.begin());
  // just display the first 10 results
  thrust::copy_n(sums.begin(), 10, std::ostream_iterator<mytype>(std::cout, ","));
  std::cout << std::endl;
}

$ nvcc -arch=sm_35 -o t1282 t1282.cu
$ ./t1282
100,100,100,100,100,100,100,100,100,100,
$

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，11 月前
查看次数：	1275 次
最近记录：	8 年，11 月前