How to set bits of a bit vector efficiently in CUDA?

Ser*_*tch 5 c++ algorithm parallel-processing cuda bit-manipulation

The task is like How to set bits of a bit vector efficiently in parallel?, but for CUDA.

Consider a bit vector of N bits in it (N is large, e.g. 4G) and an array of M numbers (M is also large, e.g. 1G), each in range 0..N-1 indicating which bit of the vector must be set to 1. The bit vector is just an array of integers, specifically uint32_t.

I've tried a naive implementation with atomicOr() on the global memory:

__global__ void BitwiseSet(const uint32_t n_indices, const uint32_t *indices,
      const uint32_t n_bits, uint32_t *bitset)
{
  const uint32_t n_threads = blockDim.x * gridDim.x;
  const uint32_t i_thread = threadIdx.x + blockDim.x * blockIdx.x;
  for(uint32_t i=i_thread; i<n_indices; i +=n_threads) {
    const uint32_t index = indices[i];
    assert(index < n_bits);
    const uint32_t i_word = index >> 5;
    const uint32_t i_bit = index & 31;
    atomicOr(bitset+i_word, 1u<<(i_bit));
  }
}
Run Code Online (Sandbox Code Playgroud)

And it produces interesting results for 4G bits and 1G indices:

  • RTX3090: 0.0383266 sec. for sorted indices vs. 0.332674 sec. for unsorted (8.68x improvement)
  • RTX2080: 0.0564464 sec. for sorted indices vs. 1.23666 sec. for unsorted (21.91x improvement)

So it seems the devices coalesce/unite multiple atomicOr() operations within a warp, and perhaps L1 cache kicks in, so when indices conflict (which is the case for sorted indices), 32-bit assignments are in reality much faster than for non-conflicting indices (the unsorted case).

Can we further improve the sorted or unsorted case?

UPDATE: answering the comments, any solution is of interest, whether it's for sorted or unsorted case, with or without repetitions. Unsorted and with repetitions is a more generic case, so it would be of the most interest.

UPDATE2: following the suggestion to vectorize memory accesses, I implemented this:

__global__ void BitwiseSet(const uint32_t n_indices, const uint32_t *indices, const uint32_t n_bits, uint32_t *bitset) {
  const uint32_t n_threads = blockDim.x * gridDim.x;
  const uint32_t i_thread = threadIdx.x + blockDim.x * blockIdx.x;
  const uint32_t n_vectors = n_indices / 4;
  for(uint32_t i=i_thread; i<n_vectors; i +=n_threads) {
    const uint4 v_index = reinterpret_cast<const uint4*>(indices)[i];
    assert(v_index.x < n_bits);
    assert(v_index.y < n_bits);
    assert(v_index.z < n_bits);
    assert(v_index.w < n_bits);
    uint4 vi_word, vi_bit;
    vi_word.x = v_index.x >> 5;
    vi_word.y = v_index.y >> 5;
    vi_word.z = v_index.z >> 5;
    vi_word.w = v_index.w >> 5;
    vi_bit.x = v_index.x & 31;
    vi_bit.y = v_index.y & 31;
    vi_bit.z = v_index.z & 31;
    vi_bit.w = v_index.w & 31;
    atomicOr(bitset+vi_word.x, 1u<<vi_bit.x);
    atomicOr(bitset+vi_word.y, 1u<<vi_bit.y);
    atomicOr(bitset+vi_word.z, 1u<<vi_bit.z);
    atomicOr(bitset+vi_word.w, 1u<<vi_bit.w);
  }
  if(i_thread < 4) {
    const uint32_t tail_start = n_vectors*4;
    const uint32_t tail_len = n_indices - tail_start;
    if(i_thread < tail_len) {
      const uint32_t index = indices[tail_start+i_thread];
      assert(index < n_bits);
      const uint32_t i_word = index >> 5;
      const uint32_t i_bit = index & 31;
      atomicOr(bitset+i_word, 1u<<i_bit);
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

But at least on RTX2080 it's slower (I don't have the eGPU with RTX3090 with me right now to test):

  • RTX2080: 0.0815998 sec. for sorted vs. 1.39829 sec. for unsorted (17.14x ratio)