在Cuda中简单地添加两个int,结果总是一样的

Los*_*oul 6 cuda

我正在开始学习Cuda的日记.我正在玩一些你好的世界型cuda代码,但它不起作用,我不知道为什么.

代码非常简单,需要两个整数并将它们添加到GPU上并返回结果,但无论我将数字更改为什么,我都得到相同的结果(如果数学以这种方式工作,我会在主题上做得比我实际上做了).

这是示例代码:

// CUDA-C includes
#include <cuda.h>
#include <stdio.h>

__global__ void add( int a, int b, int *c ) {
    *c = a + b;
}

extern "C"
void runCudaPart();

// Main cuda function

void runCudaPart() {

    int c;
    int *dev_c;

    cudaMalloc( (void**)&dev_c, sizeof(int) );
    add<<<1,1>>>( 1, 4, dev_c );

    cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );

    printf( "1 + 4 = %d\n", c );
    cudaFree( dev_c );

}
Run Code Online (Sandbox Code Playgroud)

输出似乎有点偏: 1 + 4 = -1065287167

我正在设置我的环境,只是想知道代码是否有问题,否则可能是我的环境.

更新:我试图添加一些代码来显示错误,但我没有得到输出,但数字更改(是输出错误代码而不是答案?即使我没有在内核做任何工作,除了分配一个变量我仍然得到simlair结果).

// CUDA-C includes
#include <cuda.h>
#include <stdio.h>

__global__ void add( int a, int b, int *c ) {
    //*c = a + b;
    *c = 5;
}

extern "C"
void runCudaPart();

// Main cuda function

void runCudaPart() {

    int c;
    int *dev_c;

    cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
    if(err != cudaSuccess){
         printf("The error is %s", cudaGetErrorString(err));
    }
    add<<<1,1>>>( 1, 4, dev_c );

    cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
    if(err2 != cudaSuccess){
         printf("The error is %s", cudaGetErrorString(err));
    }


    printf( "1 + 4 = %d\n", c );
    cudaFree( dev_c );

}
Run Code Online (Sandbox Code Playgroud)

代码似乎很好,也许它与我的设置有关.将Cuda安装在OSX Lion上是一场噩梦,但我认为它有用,因为SDK中的示例似乎没问题.到目前为止,我采取的步骤是访问Nvida网站并下载驱动程序,工具包和SDK的最新mac版本.然后我添加了export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH'PATH =/usr/local/cuda/bin:$ PATH`我做了一个deviceQuery,它传递了关于我的系统的以下信息:

[deviceQuery] starting...

/Developer/GPU Computing/C/bin/darwin/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: "GeForce 320M"
  CUDA Driver Version / Runtime Version          4.2 / 4.2
  CUDA Capability Major/Minor version number:    1.2
  Total amount of global memory:                 253 MBytes (265027584 bytes)
  ( 6) Multiprocessors x (  8) CUDA Cores/MP:    48 CUDA Cores
  GPU Clock rate:                                950 MHz (0.95 GHz)
  Memory Clock rate:                             1064 Mhz
  Memory Bus Width:                              128-bit
  Max Texture Dimension Size (x,y,z)             1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(8192) x 512, 2D=(8192,8192) x 512
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   No
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.2, NumDevs = 1, Device = GeForce 320M
[deviceQuery] test results...
PASSED
Run Code Online (Sandbox Code Playgroud)

更新:真正奇怪的是,即使我删除了内核中的所有工作,我仍然得到c的结果?我重新安装了cuda,并在示例中使用了make,并且所有这些都通过了.

tal*_*ies 8

基本上这里有两个问题:

  1. 您没有为正确的体系结构编译内核(从评论中收集)
  2. 您的代码包含不完整的错误检查,当运行时错误发生时,该错误检查会丢失,导致出现神秘且无法解释的症状.

在运行时API中,大多数与上下文相关的操作都是"懒惰地"执行的.当您第一次启动内核时,运行时API将调用代码以从工具链为目标硬件发出的胖二进制映像内部智能地查找合适的CUBIN映像,并将其加载到上下文中.这还可以包括针对向后兼容架构的PTX的JIT重新编译,但不是相反.因此,如果您为计算能力1.2设备编译了内核并在计算能力2.0设备上运行它,则驱动程序可以JIT编译它包含的新架构的PTX 1.x代码.但反过来不起作用.因此,在您的示例中,运行时API将生成错误,因为它无法在可执行文件中嵌入的CUDA fatbinary映像中找到可用的二进制映像.错误消息非常神秘,但您将收到错误(有关更多信息,请参阅此问题).

如果您的代码包含如下错误检查:

cudaError_t err = cudaMalloc( (void**)&dev_c, sizeof(int) );
if(err != cudaSuccess){
     printf("The error is %s", cudaGetErrorString(err));
}

add<<<1,1>>>( 1, 4, dev_c );
if (cudaPeekAtLastError() != cudaSuccess) {
    printf("The error is %s", cudaGetErrorString(cudaGetLastError()));
}

cudaError_t err2 = cudaMemcpy( &c, dev_c, sizeof(int), cudaMemcpyDeviceToHost );
if(err2 != cudaSuccess){
     printf("The error is %s", cudaGetErrorString(err));
}
Run Code Online (Sandbox Code Playgroud)

内核启动后的额外错误检查应该捕获内核加载/启动失败生成的运行时API错误.

  • 它也是专业人士的最佳实践:)绝对适用于调试版本,但我也会将检查留在发布版本中.如果在发布版本中不需要检查,请使用预处理器将其替换为存根. (3认同)