Memcpy上未指定的启动失败

Pti*_*lty 5 cuda

我在Cuda运行我的程序时遇到了"未指定的启动故障".我检查了错误.

该程序是微分方程的求解器.它迭代TOTAL_ITER次.ROOM_X和ROOM_Y是矩阵的宽度和高度.

这是标题,它的名字是"唯一:

#define ITER_BETWEEN_SAVES 10000
#define TOTAL_ITER 10000
#define ROOM_X 2048
#define ROOM_Y 2048
#define SOURCE_DIM_X 200
#define SOURCE_DIM_Y 1000
#define ALPHA 1.11e-4
#define DELTA_T 10
#define H 0.1
#include <stdio.h>

void Matrix(float* M);
void SolverCPU(float* M1, float* M2);
__global__ void SolverGPU(float* M1, float* M2);
Run Code Online (Sandbox Code Playgroud)

这是内核和填充矩阵的函数:

#include "solver.h"
#include<cuda.h>

void Matrix(float* M)
{
  for (int j = 0; j < SOURCE_DIM_Y; ++j) {
    for (int i = 0; i <  SOURCE_DIM_X; ++i) {
    M[(i+(ROOM_X/2 - SOURCE_DIM_X/2)) + ROOM_X * (j+(ROOM_Y/2 - SOURCE_DIM_Y/2))] = 100;
    }
  }
}

    __global__ void SolverGPU(float* M1,float *M2)  {
   int i =threadIdx.x + blockIdx.x * blockDim.x;
       int j = threadIdx.y + blockIdx.y * blockDim.y;

        float M1_Index = M1[i + ROOM_X * j];
        float M1_IndexUp = M1[i+1 + ROOM_X * j];
        float M1_IndexDown =M1[i-1 + ROOM_X * j];
        float M1_IndexLeft = M1[i + ROOM_X * (j+1)];
        float M1_IndexRight = M1[i + ROOM_X *(j-1)];


        M2[i + ROOM_X * j] = M1_Index + (ALPHA * DELTA_T / (H*H)) * (M1_IndexUp + M1_IndexDown + M1_IndexLeft +M1_IndexRight - 4*M1_Index);     

}
Run Code Online (Sandbox Code Playgroud)

这是主要的

int main(int argc, char* argv[] ){

    float *M1_h, *M1_d,*M2_h, *M2_d;
    int size = ROOM_X * ROOM_Y * sizeof(float);
    cudaError_t err = cudaSuccess;  

    //Allocating Memories on Host
    M1_h = (float *)malloc(size);
    M2_h = (float *)malloc(size);

    //Allocating Memories on Host
    err=cudaMalloc((void**)&M1_d, size);
    if (err != cudaSuccess) { 
        fprintf(stderr, "Failed to allocate array_d ... %s .\n", cudaGetErrorString(err)); 
        exit(EXIT_FAILURE); 
    }

    err=cudaMalloc((void**)&M2_d, size);    
    if (err != cudaSuccess) { 
        fprintf(stderr, "Failed to allocate array_d ... %s .\n", cudaGetErrorString(err)); 
        exit(EXIT_FAILURE); 
    }

    //Filling the Matrix
    Matrix(M1_h);


    //Copy on Device

    err = cudaMemcpy(M1_d, M1_h, size, cudaMemcpyHostToDevice);
    if(err !=0){
        printf("%s-%d\n",cudaGetErrorString(err),1);
        getchar();  
    }

    err=cudaMemcpy(M2_d, M2_h, size, cudaMemcpyHostToDevice);
    if(err !=0){
        printf("%s-%d",cudaGetErrorString(err),2);
        getchar();  
    }

    dim3 dimGrid(64,64);
    dim3 dimBlock(32,32);


    //SolverGPU<< <threadsPerBlock, numBlocks >> >(M1_d,M2_d);
    for(int i=0;i<TOTAL_ITER;i++) { 
    if (i%2==0) 
    SolverGPU<< <dimGrid,dimBlock >> >(M1_d,M2_d);
    else
    SolverGPU<< <dimGrid,dimBlock >> >(M2_d,M1_d);
    }   

    err=cudaMemcpy(M1_h, M1_d, size, cudaMemcpyDeviceToHost);
    if(err !=0){
        printf("%s-%d",cudaGetErrorString(err),3);
        getchar();  
    }   

    cudaFree(M1_d);
    cudaFree(M2_d);

    free(M1_h);
    free(M2_h);
    return 0;   

}
Run Code Online (Sandbox Code Playgroud)

在编译时没有问题.

当我检查我的错误时,内核后的memcpy上会出现"未指定的启动失败".

好的,所以我读到它通常是由于内核无法正常运行.但我无法在内核中找到错误...我猜错误很简单,但无法找到它.

Rob*_*lla 47

当我编译并运行你的代码时,我得到:

an illegal memory access was encountered-3
Run Code Online (Sandbox Code Playgroud)

打印出来.

您可能确实会遇到"未指定的启动失败".确切的错误报告将取决于CUDA版本,GPU和平台.但我们可以继续向前迈进.

这两条消息都表明内核已启动但遇到错误,因此无法成功完成.您可以使用调试器调试内核执行问题,例如Linux上的cuda-gdb或Windows上的Nsight VSE.但是我们还不需要拔出调试器.

一个有用的工具是cuda-memcheck.如果我们运行你的程序cuda-memcheck,我们得到一些额外的输出,表明内核正在执行大小为4的无效全局读取.这意味着你正在进行越界内存访问.如果我们重新编译添加-lineinfo交换机的代码,然后重新运行代码,我们可以获得更多的清晰度cuda-memcheck.现在我们得到如下所示的输出:

$ nvcc -arch=sm_20 -lineinfo -o t615 t615.cu
$ cuda-memcheck ./t615 |more
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x00000070 in /home/bob/misc/t615.cu:34:SolverGPU(float*, float*)
=========     by thread (31,0,0) in block (3,0,0)
=========     Address 0x4024fe1fc is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x150a7d]
=========     Host Frame:./t615 [0x11ef8]
=========     Host Frame:./t615 [0x3b143]
=========     Host Frame:./t615 [0x297d]
=========     Host Frame:./t615 (__gxx_personality_v0 + 0x378) [0x26a0]
=========     Host Frame:./t615 (__gxx_personality_v0 + 0x397) [0x26bf]
=========     Host Frame:./t615 [0x2889]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf4) [0x1d994]
=========     Host Frame:./t615 (__gxx_personality_v0 + 0x111) [0x2439]
=========
--More--
Run Code Online (Sandbox Code Playgroud)

(还有更多的错误输出)

这意味着您的内核遇到的第一个错误是大小为4的无效全局读取(例如,尝试读取intfloat数量的越界访问).使用lineinfo信息,我们可以看到发生了这种情况:

=========     at 0x00000070 in /home/bob/misc/t615.cu:34:SolverGPU(float*, float*)
Run Code Online (Sandbox Code Playgroud)

即在文件中的第34行.这一行恰好是这行内核代码:

    float M1_IndexRight = M1[i + ROOM_X *(j-1)];
Run Code Online (Sandbox Code Playgroud)

我们可以进一步调试,也许使用内核printf语句来发现问题所在.但是我们已经知道我们正在索引越界,所以让我们检查索引:

  i + ROOM_X *(j-1)
Run Code Online (Sandbox Code Playgroud)

i= 0和j= 0时(即对于2D线程数组中的线程(0,0)),这会评估什么?它的评估结果为-2048(即 - ROOM_X),这是一个非法指数.试图读取M1[-2048]会产生错误.

你的内核中有很多复杂的索引,所以我很确定还有其他错误.您可以使用类似的方法来跟踪它们(可能使用printf吐出计算的索引,或者测试索引的有效性).