CUDA字节原子操作只导致一个线程动作

Ama*_*aed 0 cuda atomic gpu-shared-memory

我正在编写一个 CUDA 程序,它在共享内存中定义了一个数组。我需要做的是只允许一个线程写入此数组中的每个索引,即到达此写入指令的第一个线程应更改其值,但同一扭曲或下一个扭曲中的任何其他线程应读取写入的值。

这是代码片段:

char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
    seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
    printf("copy seq_shared seq_1_index = %d,  block = %d \n", seq_1_index, blockIdx.x);
}
Run Code Online (Sandbox Code Playgroud)

现在发生的情况是,warp 中的所有线程都执行这些确切的指令序列,因此 if 条件中的剩余代码被执行了 32 次。我只需要执行一次。

我怎样才能做到这一点?

Rob*_*lla 5

您可以用于atomicCAS()此用途。它执行原子比较和交换操作。

该函数将测试一个变量,如果它匹配某个条件(例如,假),它将用另一个值(例如,真)替换它。它将以原子方式完成所有这些事情,即没有中断的可能性。

在这种情况下,原子函数的返回值为我们提供了有用的信息。如果上面的例子返回值为 false,那么我们可以确定它被替换为 true。我们还可以确定我们是遇到这种情况的“第一个”线程,并且执行类似操作的所有其他线程将返回 true,而不是 false。

这是一个有效的例子:

$ cat t327.cu
#include <stdio.h>

__global__ void k(){

  __shared__ int flag;
  if (threadIdx.x == 0) flag = 0;
  __syncthreads();

  int retval = atomicCAS(&flag, 0, 1);
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
  // could do if statement on retval here
}


int main(){

  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
Run Code Online (Sandbox Code Playgroud)

回答评论中的问题,我们可以char通过创建根据编程指南double atomicAdd()中给出的函数建模的任意原子操作来将其扩展为大小标志。基本思想是,我们将使用支持的数据大小(例如 )执行atomicCAS,并且我们将转换所需的操作以有效支持大小。这是通过将地址转换为适当对齐的地址,然后对数量进行移位以排列在值中适当的字节位置来完成的。unsignedcharcharunsignedcharunsigned

这是一个有效的例子:

$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
  unsigned mask = 0xFFU;
  mask <<= al_offset;
  mask = ~mask;
  unsigned sval = val;
  sval <<= al_offset;
  unsigned old = *al_addr, assumed, setval;
  do {
        assumed = old;
        setval = assumed & mask;
        setval |= sval;
        old = atomicCAS(al_addr, assumed, setval);
    } while (assumed != old);
  return (char) ((assumed >> al_offset) & 0xFFU);
}

__global__ void k(){

  __shared__ char flag[1024];
  flag[threadIdx.x] = 0;
  __syncthreads();

  int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}


int main(){
  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
Run Code Online (Sandbox Code Playgroud)

上面给出了一般的atomicCAS尺寸char。这将允许您将任何值交换char为任何其他char值。在您的具体情况下,如果您只需要有效的布尔标志,您可以使用atomicOr评论中已经提到的方法使此操作更加有效。使用atomicOr可以让您消除上面自定义原子函数中的循环。这是一个有效的例子:

$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
  unsigned my_bit = 1U << al_offset;
  return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}

__global__ void k(){

  __shared__ char flag[1024];
  flag[threadIdx.x] = 0;
  __syncthreads();

  int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}


int main(){
  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
Run Code Online (Sandbox Code Playgroud)

这些char原子方法假设您分配了一个大小为 4 倍数的数组。例如,char使用大小为 3 的数组(且只有 3 个线程)来执行此操作是无效的。char