有人可以帮我转换嵌套的for循环到CUDA内核吗?这是我试图转换为CUDA内核的函数:
// Convolution on Host
void conv(int* A, int* B, int* out) {
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
out[i + j] += A[i] * B[j];
}
Run Code Online (Sandbox Code Playgroud)
我已经非常努力地并行化这段代码.
这是我的尝试:
__global__ void conv_Kernel(int* A, int* B, int* out) {
int i = blockIdx.x;
int j = threadIdx.x;
__shared__ int temp[N];
__syncthreads();
temp[i + j] = A[i] * B[j];
__syncthreads();
int sum = 0;
for (int k = 0; k …Run Code Online (Sandbox Code Playgroud)