CUDA线程块大小1024不起作用(cc = 20,sm = 21)

vit*_*ums 3 cuda

我正在运行的配置: - CUDA Toolkit 5.5 - NVidia Nsight Eclipse版--Ubuntu 12.04 x64 - CUDA设备是NVidia GeForce GTX 560:cc = 20,sm = 21(如你所见,我可以使用最多1024个线程的块)

我在iGPU(Intel HD Graphics)上渲染我的显示器,所以我可以使用Nsight调试器.

但是当我设置线程> 960时,我遇到了一些奇怪的行为.

码:

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void mytest() {
    float a, b;
    b = 1.0F;
    a = b / 1.0F;
}

int main(void) {

    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Here I run my kernel
    mytest<<<1, 961>>>();

    err = cudaGetLastError();

    if (err != cudaSuccess) {
        fprintf(stderr, "error=%s\n", cudaGetErrorString(err));
        exit (EXIT_FAILURE);
    }

    // Reset the device and exit
    err = cudaDeviceReset();

    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to deinitialize the device! error=%s\n",
                cudaGetErrorString(err));
        exit (EXIT_FAILURE);
    }

    printf("Done\n");
    return 0;
}
Run Code Online (Sandbox Code Playgroud)

并且...它不起作用.问题出在浮点除法的最后一行代码中.每次我尝试按浮动划分时,我的代码都会编译,但不起作用.运行时的输出错误是:

error =请求启动的资源太多

这是我在调试中得到的,当我跨过它时:

警告:检测到Cuda API错误:cudaLaunch返回(0x7)

使用-Xptxas -v构建输出:

12:57:39 **** Incremental Build of configuration Debug for project block_size_test ****
make all 
Building file: ../src/vectorAdd.cu
Invoking: NVCC Compiler
/usr/local/cuda-5.5/bin/nvcc -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -G -g -O0 -m64 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -optf /home/vitrums/cuda-workspace/block_size_test/options.txt -gencode arch=compute_20,code=sm_20 -gencode arch=compute_20,code=sm_21 -odir "src" -M -o "src/vectorAdd.d" "../src/vectorAdd.cu"
/usr/local/cuda-5.5/bin/nvcc --compile -G -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -O0 -g -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_21 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -m64 -optf /home/vitrums/cuda-workspace/block_size_test/options.txt  -x cu -o  "src/vectorAdd.o" "../src/vectorAdd.cu"
../src/vectorAdd.cu(7): warning: variable "a" was set but never used

../src/vectorAdd.cu(7): warning: variable "a" was set but never used

ptxas info    : 4 bytes gmem, 8 bytes cmem[14]
ptxas info    : Function properties for _ZN4dim3C1Ejjj
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Compiling entry function '_Z6mytestv' for 'sm_21'
ptxas info    : Function properties for _Z6mytestv
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 34 registers, 8 bytes cumulative stack size, 32 bytes cmem[0]
ptxas info    : Function properties for _ZN4dim3C2Ejjj
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Finished building: ../src/vectorAdd.cu

Building target: block_size_test
Invoking: NVCC Linker
/usr/local/cuda-5.5/bin/nvcc --cudart static -m64 -link -o  "block_size_test"  ./src/vectorAdd.o   
Finished building target: block_size_test


12:57:41 Build Finished (took 1s.659ms)
Run Code Online (Sandbox Code Playgroud)

当我添加-keep键时,编译器生成.cubin文件,但我无法读取它以找出smem和reg的值,在这个主题之后太多 - 资源请求 - 启动 - 如何 -找出什么资源 - /.至少现在这个文件必须有一些不同的格式.

因此,我被迫每块使用256个线程,这可能不是一个坏主意,考虑到这个.xls:CUDA_Occupancy_calculator.

无论如何.任何帮助将不胜感激.

Mic*_* M. 5

我用当前的信息填写了CUDA占用计算器文件:

  • 计算能力:2.1
  • 每块线程:961
  • 每个线程的注册数:34
  • 共享内存:0

我的入住率为0%,受寄存器数限制.
如果您将线程数设置为960,则占用率为63%,这就解释了它的工作原理.

尝试将寄存器数量限制为32,并将线程数设置为1024,占用率为67%.

要限制寄存器的数量,请使用以下选项: nvcc [...] --maxrregcount=32