有人可以帮我转换嵌套的for循环到CUDA内核吗?这是我试图转换为CUDA内核的函数:
// Convolution on Host
void conv(int* A, int* B, int* out) {
    for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
            out[i + j] += A[i] * B[j];
}
我已经非常努力地并行化这段代码.
这是我的尝试:
__global__ void conv_Kernel(int* A, int* B, int* out) {
    int i = blockIdx.x;
    int j = threadIdx.x;
    __shared__ int temp[N];
    __syncthreads();
    temp[i + j] = A[i] * B[j];
    __syncthreads();
    int sum = 0;
    for (int k = 0; k …