我试图有效地在CUDA上循环二维数组.在我的主机代码中
double **h_matrix; // Matrix on host of size Nx by Ny
double tmp;
...
for(i = 0; i < Nx; i++) {
for(j = 0; j < Ny; j++) {
tmp = h_matrix[i][j];
... // Perform some operation on tmp
h_matrix[i][j] = tmp;
}
}
Run Code Online (Sandbox Code Playgroud)
为了在CUDA中有效地执行类似的任务,我理解我必须使用cudaMallocPitch()
为2D数组分配内存,如CUDA编程指南中所示(例如滚动一下).这个例子并没有多大帮助,因为该内核不使用有关网格,块或线程执行它的任何信息,即使它是作为启动它<<<100, 512>>>
.
NVidia'a Parallel forall博客建议使用网格跨步循环来编写灵活且可扩展的内核,但是,他们的示例仅使用1D阵列.如何为使用cudaMallocPitch()
上面显示的代码并行化分配的2D数组编写网格步幅循环?我应该使用2D dimGrid和dimBlock,如果是这样,怎么样?
这是我根据JackOLantern的答案创建的完整可编辑示例.
#include <stdio.h>
#include <assert.h>
#define N 11
#define M 3
__global__ void kernel(float * d_matrix, size_t pitch) {
for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < N; j += blockDim.y * gridDim.y) {
float* row_d_matrix = (float*)((char*)d_matrix + j*pitch);
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < M; i += blockDim.x * gridDim.x) {
row_d_matrix[i] = (j * M + i) + (j * M + i);
}
}
}
void verify(float *h, float *d, int size) {
for (int i = 0; i < size; i++) {
assert(h[i] == d[i]);
}
printf("Results match\n");
}
int main() {
float *h_matrix;
float *d_matrix;
float *dc_matrix;
h_matrix = (float *) malloc(M * N * sizeof(float));
dc_matrix = (float *) malloc(M * N * sizeof(float));
for (int j = 0; j < N; j++) {
for (int i = 0; i < M; i++) {
h_matrix[j * M + i] = (j * M + i) + (j * M + i);
}
}
size_t pitch;
cudaMallocPitch(&d_matrix, &pitch, M * sizeof(float), N);
dim3 grid(1, 1, 1);
dim3 block(3, 3, 1);
kernel<<<grid, block>>>(d_matrix, pitch);
cudaMemcpy2D(dc_matrix, M * sizeof(float), d_matrix, pitch, M * sizeof(float), N, cudaMemcpyDeviceToHost);
verify(h_matrix, dc_matrix, M * N);
free(h_matrix);
cudaFree(d_matrix);
free(dc_matrix);
}
Run Code Online (Sandbox Code Playgroud)
这是一个古老的问题,但我有一些建议可以改善公认的答案。我已经忽略了音调部分,因为该部分已被覆盖。
接受的答案在遍历所有值的同时并未考虑线程之间的任何平衡类型。
Lets take an small example. Lets say we start a kernel up with <<<1, block>>>
where block is dim3 block(2,2)
. Then we are doing work on a 5x5 matrix. Now as per the above suggestion the work distribution would end up being so that that the thread with id (0,0) gets 9 of the runs while threads (0,1) and (1,0) get 6 each and (1,1) gets 4 runs in total.
So my suggestion to balance the load better would be to flatten the loop and calculate the indices from the flattened loop.
So my suggestion is something more along the line of
int n=11;
int m=3;
int i, j, k;
for(i = (blockIdx.y * blockDim.y + threadIdx.y) * blockDim.x * gridDim.x +
(blockIdx.x * blockDim.x + threadIdx.x);
i < m*n;
i += blockDim.y * gridDim.y * blockDim.x * gridDim.x) {
j = i/m;
k = i%m;
//Calculations here
}
Run Code Online (Sandbox Code Playgroud)
This would then apply to the above example as before in regards to pitch that you can find the row from the value of j.
网格步幅循环概念在2D情况下与分配的2D矩阵有关的扩展cudaMallocPitch
可能看起来像:
#define N 11
#define M 3
__global__ void kernel(float * d_matrix, size_t pitch) {
int idx = blockIdx.x*blockDim.x + threadIdx.x;
int idy = blockIdx.y*blockDim.y + threadIdx.y;
for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < N; j += blockDim.y * gridDim.y)
{
float* row_d_matrix = (float*)((char*)d_matrix + idy*pitch);
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < M; i += blockDim.x * gridDim.x) {
row_d_matrix[i] = ....
}
}
}
int main()
{
float *d_matrix;
size_t pitch;
cudaMallocPitch(&d_matrix,&pitch,M*sizeof(float),N);
kernel<<<GridSize,BlockSize>>>(d_matrix,pitch);
// Other stuff
}
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
3276 次 |
最近记录: |