CUDA - atomicAdd仅加起来16777216

use*_*789 1 cuda

在运行以下内核时,我有以下容易重现的问题,除了atomicAdds of floats之外什么都不做:

#define OUT_ITERATIONS 20000000
#define BLOCKS 12
#define THREADS 192

__global__ void testKernel(float* result) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    float bias = 1.0f;
    int n = 1;

    while (i < OUT_ITERATIONS) {
        atomicAdd(result, bias);
        i += BLOCKS * THREADS;
    }
}
Run Code Online (Sandbox Code Playgroud)

内核应该将结果递增OUT_ITERATIONS次数,即20M.我用这个标准代码调用内核:

int main() {
cudaError_t cudaStatus;
float* result;
float* dev_result;

// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
    goto Error;
}

result = new float;
cudaStatus = cudaMalloc((void**)&dev_result, sizeof(float));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}
cudaStatus = cudaMemset(dev_result, 0, sizeof(float));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemset failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}

// Launch a kernel on the GPU with one thread for each element.
testKernel<<<BLOCKS, THREADS>>>(dev_result);

// Check for any errors launching the kernel
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}

// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
    goto Error;
}

cudaStatus = cudaMemcpy(result, dev_result, sizeof(float), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}

printf("Result: %f\n", *result);
Run Code Online (Sandbox Code Playgroud)

但是,最后打印的结果是16777216.0,偶然是十六进制的0x1000000.如果OUT_ITERATIONS <16777216,即如果我将其更改为16777000,则不会出现此问题,确定输出为16777000.0!

系统:NVidia-Titan,CUDA 5.5,Windows7

kan*_*yin 7

此问题是由于该类型的精度有限float.

float只有24位二进制精度.如果添加2个数字,其中一个数字2^24-1大于另一个数字,则结果将与较大的数字完全相同.

当您添加像16777216.0(= 2 ^ 24)这样的大数字(如1.0)这样的小数字时,您将失去一些精度,结果仍然是16777216.0.同样的情况发生在标准的C propgram中

float a=16777216.0f;
float b=1.0f;
printf("%f\n",a+b);
Run Code Online (Sandbox Code Playgroud)

您可以替换floatdoubleint解决此问题.

请参考cuda doc来实现该double版本atomicAdd()

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

  • @Archaea我认为你的链接和我的链接中的代码提供了相同的功能,可以原子地为现有的双数据添加一个值.他们都使用atomicCAS()来实现它.我链接中提供的实现似乎更有效. (2认同)