我有一些代码想要制作成 cuda 内核。看:
for (r = Y; r < Y + H; r+=2)
{
ch1RowSum = ch2RowSum = ch3RowSum = 0;
for (c = X; c < X + W; c+=2)
{
chan1Value = //some calc'd value
chan3Value = //some calc'd value
chan2Value = //some calc'd value
ch2RowSum += chan2Value;
ch3RowSum += chan3Value;
ch1RowSum += chan1Value;
}
ch1Mean += ch1RowSum / W;
ch2Mean += ch2RowSum / W;
ch3Mean += ch3RowSum / W;
}
Run Code Online (Sandbox Code Playgroud)
是否应该将其分成两个内核,一个用于计算 RowSums,另一个用于计算平均值,我应该如何处理循环索引不是从零开始并以 N 结束的事实?
假设您有一个计算这三个值的内核。配置中的每个线程都会计算每个 (r,c) 对的三个值。
__global__ value_kernel(Y, H, X, W)
{
r = blockIdx.x + Y;
c = threadIdx.x + W;
chan1value = ...
chan2value = ...
chan3value = ...
}
Run Code Online (Sandbox Code Playgroud)
我不相信您可以在上述内核中计算总和(至少完全并行)。您将无法像上面那样使用 += 。如果每个块(行)中只有一个线程进行求和并求平均值,则可以将其全部放入一个内核中,如下所示......
__global__ both_kernel(Y, H, X, W)
{
r = blockIdx.x + Y;
c = threadIdx.x + W;
chan1value = ...
chan2value = ...
chan3value = ...
if(threadIdx.x == 0)
{
ch1RowSum = 0;
ch2RowSum = 0;
ch3RowSum = 0;
for(i=0; i<blockDim.x; i++)
{
ch1RowSum += chan1value;
ch2RowSum += chan2value;
ch3RowSum += chan3value;
}
ch1Mean = ch1RowSum / blockDim.x;
ch2Mean = ch2RowSum / blockDim.x;
ch3Mean = ch3RowSum / blockDim.x;
}
}
Run Code Online (Sandbox Code Playgroud)
但最好使用第一个值内核,然后使用第二个内核来计算总和和均值...可以进一步并行化下面的内核,如果它是单独的,那么您可以在准备好时专注于它。
__global__ sum_kernel(Y,W)
{
r = blockIdx.x + Y;
ch1RowSum = 0;
ch2RowSum = 0;
ch3RowSum = 0;
for(i=0; i<W; i++)
{
ch1RowSum += chan1value;
ch2RowSum += chan2value;
ch3RowSum += chan3value;
}
ch1Mean = ch1RowSum / W;
ch2Mean = ch2RowSum / W;
ch3Mean = ch3RowSum / W;
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1358 次 |
| 最近记录: |