遇到CUDA非法内存访问

Val*_*lSe 3 c++ cuda

我在自己的笔记本电脑上编写了一个 CUDA 程序,该笔记本电脑配有 Nvidia GTX 960M。该代码运行没有任何问题。我还实现了错误检查,可以在此线程中找到: What is the canonical way to check for error using the CUDA Runtime API?

并使用 测试了代码cuda-memcheck,报告了 0 个错误。

我想在具有 Nvidia Titan X 的服务器上测试我的代码。但是cudaPeekAtLastError()会抛出错误:

illegal memory access was encountered
Run Code Online (Sandbox Code Playgroud)

对于我的笔记本电脑和服务器,我使用以下堆分配

cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1024 * 1024 * 1024);
Run Code Online (Sandbox Code Playgroud)

并运行以下线程和块:

int blockSize = 128;
int numBlocks = (nPossibilities + blockSize - 1) / blockSize;
Run Code Online (Sandbox Code Playgroud)

GTX 960M 的计算能力为 5,而 Titan X 的计算能力为 6.1,但根据计算能力表(维基百科),两者都最多有 32 个活动块,每个多处理器最多有 2048 个线程。

cuda-memcheck在服务器上运行了,非法内存访问的问题是由于空指针造成的。

为了解决这个问题,我使用以下几行将堆内存大小分配从 1GB 增加到 2GB,问题得到解决:

const size_t malloc_limit = size_t(2048) * size_t(2048) * size_t(2048);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, malloc_limit); 
Run Code Online (Sandbox Code Playgroud)

我的问题是为什么这个问题在Titan X上出现而在960M上没有出现?为什么我需要增加分配给 Titan X 的堆内存大小而不是 960M?

如果需要,我可以发布我的代码,但这是一个很大的代码,在内核内部有多个函数调用。

之后的错误cuda-memcheck如下:

GPUassert: unspecified launch failure all.cu 779
========= CUDA-MEMCHECK
========= Invalid __global__ write of size 8
=========     at 0x00001130 in /home/osa/cuda/all.cu:186:fun(double*, double*, double*, double*, double*, double*, int, int, int)
=========     by thread (125,0,0) in block (193,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2fe) [0x282a4e]
=========     Host Frame:./all [0x1dac1]
=========     Host Frame:./all [0x382d3]
=========     Host Frame:./all [0x9508]
=========     Host Frame:./all [0x93c0]
=========     Host Frame:./all [0x942d]
=========     Host Frame:./all [0x8d7a]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
=========     Host Frame:./all [0x2999]
=========
========= Invalid __global__ write of size 8
=========     at 0x00001130 in /home/osa/cuda/all.cu:186:fun(double*, double*, double*, double*, double*, double*, int, int, int)
=========     by thread (124,0,0) in block (193,0,0)
=========     Address 0x00000000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2fe) [0x282a4e]
=========     Host Frame:./all [0x1dac1]
=========     Host Frame:./all [0x382d3]
=========     Host Frame:./all [0x9508]
=========     Host Frame:./all [0x93c0]
=========     Host Frame:./all [0x942d]
=========     Host Frame:./all [0x8d7a]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
=========     Host Frame:./all [0x2999]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 [0x391b13]
=========     Host Frame:./all [0x3c2c6]
=========     Host Frame:./all [0x8d83]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
=========     Host Frame:./all [0x2999]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaPeekAtLastError.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 [0x391b13]
=========     Host Frame:./all [0x39b93]
=========     Host Frame:./all [0x8d88]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20840]
=========     Host Frame:./all [0x2999]
=========
========= ERROR SUMMARY: 4 errors
Run Code Online (Sandbox Code Playgroud)

在我的代码中,计算最多 19 位数字的组合的可能性总数。这个数字决定了线程的总数。可能性的计算方式(2^n)-1是,如果我选择 9 位数字,则为 511,因此该进程总共将执行 511 个线程。

虽然对于内核配置我选择blocksize为 128,但我还给出了可能性的数量 ( nPossibilities) 作为参数,并且在内核内部我执行了以下操作:

if (idx > 0 && idx < nPossibilities)
{
 //Do something
}
Run Code Online (Sandbox Code Playgroud)

在服务器上,代码最多可容纳 15 位数字,相当于 32,767。16 及以上导致问题中发布的错误。16 人则为 65,536。这是否意味着对于 Titan Xp,约 32,000 个运行中的线程需要 1GB 堆以及以上,我需要分配更多?但对于 19 位数字,我总共需要 524,287 个线程!这是很多!那么,1GB 为何足以容纳约 32,000 个线程,而 2GB 又足以容纳约 524,000 个线程呢?

我在内核内部分配的变量的大小new也取决于位数。我粗略计算了分配变量的大小,15 位为 0.032MB,16 位为 0.034MB,19 位为 0.0415MB

Rob*_*lla 5

Because the Titan Xp supports more threads "in flight" than a 960M.

Presumably in your CUDA device code, you are doing something like malloc or new (and hopefully also free or delete). These allocate out of the device heap, the size of which is controlled by the CUDA runtime API call you are using for this: cudaDeviceSetLimit(cudaLimitMallocHeapSize, 1024 * 1024 * 1024);.

The maximum number of threads that can be running at a given time on either of these GPUs is given by 2048*number of SMs. Even if the occupancy number for your particular code is less than 2048 per SM, the number (max occupancy threads per SM) is probably the same whether on 960M or Titan Xp.

因此,运行中的线程总数由 SM 的数量决定。960M 有 5 个 SM,因此最多可以有 2048x5 = ~10,000 个正在运行的线程(即在执行的某个阶段)。Titan Xp 有 30 个 SM,因此它可以有 2048x30 = ~60,000 个正在运行的线程。这意味着,如果每个线程执行malloc特定大小的 a ,然后执行 a free,则在 960M 上的任何时间点都可能有 10,000 个未完成的分配,但在 Titan Xp 上的任何时间点可能有 60,000 个未完成的分配。更多未完成的分配 = 对(设备堆)内存的更多需求。

因此,在 Titan Xp 上与 960M 上,您很可能需要更多的设备堆可用空间。