Ros*_*han 5 performance synchronization cuda sum
我需要总结100000
存储在数组中的值,但有条件.
有没有办法在CUDA中做到这一点以产生快速结果?
任何人都可以发布一个小代码来做到这一点?
我认为,要执行条件约简,您可以直接将条件引入为乘法0
(false) 或1
(true) 相乘的加数引入。换句话说,假设您希望满足的条件是加数小于10.f
。在这种情况下,借用M. Harris 的 Optimizing Parallel Reduction in CUDA 中的第一个代码,那么上面的意思就是
__global__ void reduce0(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i]*(g_data[i]<10.f);
__syncthreads();
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
Run Code Online (Sandbox Code Playgroud)
如果您希望使用 CUDA Thrust 执行条件缩减,您可以使用 来执行相同操作thrust::transform_reduce
。或者,您可以创建一个新的向量d_b
,复制d_a
满足谓词 by的所有元素thrust::copy_if
,然后thrust::reduce
应用于d_b
。我还没有检查哪种解决方案表现最好。也许,第二种解决方案在稀疏数组上表现更好。下面是这两种方法的实现示例。
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/count.h>
#include <thrust/copy.h>
// --- Operator for the first approach
struct conditional_operator {
__host__ __device__ float operator()(const float a) const {
return a*(a<10.f);
}
};
// --- Operator for the second approach
struct is_smaller_than_10 {
__host__ __device__ bool operator()(const float a) const {
return (a<10.f);
}
};
void main(void)
{
int N = 20;
// --- Host side allocation and vector initialization
thrust::host_vector<float> h_a(N,1.f);
h_a[0] = 20.f;
h_a[1] = 20.f;
// --- Device side allocation and vector initialization
thrust::device_vector<float> d_a(h_a);
// --- First approach
float sum = thrust::transform_reduce(d_a.begin(), d_a.end(), conditional_operator(), 0.f, thrust::plus<float>());
printf("Result = %f\n",sum);
// --- Second approach
int N_prime = thrust::count_if(d_a.begin(), d_a.end(), is_smaller_than_10());
thrust::device_vector<float> d_b(N_prime);
thrust::copy_if(d_a.begin(), d_a.begin() + N, d_b.begin(), is_smaller_than_10());
sum = thrust::reduce(d_b.begin(), d_b.begin() + N_prime, 0.f);
printf("Result = %f\n",sum);
getchar();
}
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
712 次 |
最近记录: |