我正在使用cuda toolkit 8.0在Ubuntu 16.04中使用CUDA开发应用程序.
我的问题是关于gtx960m(功能5.0)可以包含的每个块的线程数.我正在尝试使用每个块中的最大线程数,因此,我使用cudaGetDeviceProperties()函数来获取此信息(1024个线程,您可以在文档中看到)但是当我使用超过512个我的内核中的每个块的线程API返回错误代码0x7("警告:检测到Cuda API错误:cudaLaunch返回(0x7)"),这意味着"启动资源".
我的问题的一个小示例代码:
#include <random>
#include <curand.h>
#include <curand_kernel.h>
#define min(a,b) (a<b?a:b);
__global__ void bootstrap_V1(int nSamples, int sampleFraction, int seed, unsigned int* sampleIDs, unsigned int* inbagCounts){
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int offset = gridDim.x * blockDim.x;
/*Generating a random number in a specific ranger:
1- Use CURAND to generate a uniform distribution between 0.0 and 1.0
2- Then multiply this by the desired range (largest value - smallest value + 0.999999).
3- Then add the offset (+ smallest value).
4- Then truncate to an integer.
*/
curandState state;
curand_init(seed, tid, 0, &state);
while(tid < nSamples*sampleFraction){
float randf = curand_uniform(&state);
randf *= ((nSamples - 1) - 0) + 0.999999;
randf += 0;
int rand = (int)truncf(randf);
sampleIDs[tid] = rand;
atomicAdd(&(inbagCounts[rand]), 1);
tid += offset;
}
}
int main(void) {
int nSamples = 100;
int sampleFraction = 1;
std::random_device rd;
std::mt19937_64 gen(rd());
std::uniform_int_distribution<size_t>dist;
cudaError_t error;
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int blocks = prop.multiProcessorCount;
int maxThreadsPerBlock = prop.maxThreadsPerBlock;
int seed = dist(gen);
unsigned int *sampleIDs = (unsigned int *)malloc(nSamples * sampleFraction * sizeof(int));
unsigned int *inbagCounts = (unsigned int *)malloc(nSamples * sizeof(int));
unsigned int *dev_sampleIDs, *dev_inbagCounts;
error = cudaMalloc((void **)&dev_sampleIDs, nSamples*sampleFraction*sizeof(int));
error = cudaMalloc((void **)&dev_inbagCounts, nSamples*sizeof(int));
error = cudaMemset(dev_sampleIDs, 0, nSamples*sampleFraction*sizeof(int));
error = cudaMemset(dev_inbagCounts, 0, nSamples*sizeof(int));
if (error != cudaSuccess)
printf("%s\n", cudaGetErrorString(error));
int threadsPerBlock = min(maxThreadsPerBlock, nSamples);
bootstrap_V1<<<blocks,threadsPerBlock>>>(nSamples, sampleFraction, seed, dev_sampleIDs, dev_inbagCounts);
cudaMemcpy(sampleIDs, dev_sampleIDs, nSamples*sampleFraction*sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(inbagCounts, dev_inbagCounts, nSamples*sizeof(int), cudaMemcpyDeviceToHost);
free(sampleIDs);
free(inbagCounts);
cudaFree(dev_inbagCounts);
cudaFree(dev_sampleIDs);
}
Run Code Online (Sandbox Code Playgroud)
这是我正在使用的编译行:
/usr/local/cuda-8.0/bin/nvcc -G -g -O0 -std=c++11 -gencode arch=compute_50,code=sm_50 -odir "." -M -o "main.d" "../main.cu
/usr/local/cuda-8.0/bin/nvcc -G -g -O0 -std=c++11 --compile --relocatable-device-code=false -gencode arch=compute_50,code=compute_50 -gencode arch=compute_50,code=sm_50 -x cu -o "main.o" "../main.cu"
/usr/local/cuda-8.0/bin/nvcc --cudart static --relocatable-device-code=false -gencode arch=compute_50,code=compute_50 -gencode arch=compute_50,code=sm_50 -link -o "prueba" ./main.o
Run Code Online (Sandbox Code Playgroud)
有人能解释为什么会发生这种情况吗?非常感谢你.
根据要求,PTAX详细:
ptxas info : 77696 bytes gmem, 72 bytes cmem[3]
ptxas info : Function properties for cudaDeviceGetAttribute
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Compiling entry function '_Z12bootstrap_V1iiiPjS_' for 'sm_50'
ptxas info : Function properties for _Z12bootstrap_V1iiiPjS_
6560 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 128 registers, 6560 bytes cumulative stack size, 352 bytes cmem[0], 8 bytes cmem[2]
ptxas info : Function properties for cudaOccupancyMaxActiveBlocksPerMultiprocessor
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN4dim3C2Ejjj
16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads
ptxas info : Function properties for cudaMalloc
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for cudaGetDevice
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN4dim3C1Ejjj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN61_INTERNAL_39_tmpxft_000013a3_00000000_7_main_cpp1_ii_055b743a9atomicAddEPjj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for cudaFuncGetAttributes
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Run Code Online (Sandbox Code Playgroud)
有人能解释为什么会发生这种情况吗?
观察的近端原因包含在此输出中:
ptxas info : Compiling entry function '_Z12bootstrap_V1iiiPjS_' for 'sm_50'
ptxas info : Function properties for _Z12bootstrap_V1iiiPjS_
6560 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 128 registers, 6560 bytes cumulative stack size, 352 bytes cmem[0], 8 bytes cmem[2]
Run Code Online (Sandbox Code Playgroud)
函数bootstrap_V1(即内核)的上述输出表明编译器已选择每个线程使用128个寄存器.
为了使线程块能够启动,并且因此该内核工作,每个线程(128)的寄存器数量和线程总数(例如512或1024)的乘积给出了所需的寄存器总数.
该数字必须小于或等于SM中的可用寄存器.CUDA编程指南的表13(以及deviceQueryGPU 的输出)中给出了最大可用寄存器.对于cc 5.0 GPU,每个SM的最大寄存器为64K.对于您的代码,每个线程128个寄存器*1024个线程产生128K寄存器,这将无法工作并导致您看到的错误("启动请求的资源太多").如果您在代码中执行了正确的CUDA错误检查,您将收到这个基于文本的错误消息,而不是简洁的"0x7"错误.
当您将每个块的线程数减少到512时,产品是64K,这是有效的.
如果你对这个主题进行一些搜索("CUDA每个线程的寄存器太多"),你会发现许多基本上与我上面描述的相同的处理方式,作为解释.
解决此问题的典型方法限制了GPU编译器寄存器的使用.最好的方法是使用启动边界.通过使用-maxrregcount切换到nvcc编译器也存在粗略的方法.规范-maxrregcount是Nsight Eclipse Edition中的可选选项.
作为一个简单的证据来证明这实际上不是由于每个块的512个线程的硬限制(事实并非如此),您可以设置-maxrregcount为63,然后应该正确启动此代码.
另请注意,您正在构建一个调试项目,并且调试和发布项目之间的编译器差异可能会影响每个线程的寄存器使用情况,以及许多其他因素.
它与您的问题无关,但您可能希望了解如果您希望获得最高性能,则不建议构建调试项目.
| 归档时间: |
|
| 查看次数: |
1105 次 |
| 最近记录: |