在CUDA中的异步memcpy期间设备同步

Ste*_*ini 4 asynchronous cuda memcpy

假设我想在CUDA中对设备执行异步memcpy主机,然后立即运行内核.如果异步传输已完成,我如何在内核中进行测试?

Jar*_*ock 8

使用CUDA"流"对异步副本和内核启动进行排序可确保在异步传输完成后内核执行.以下代码示例演示:

#include <stdio.h>

__global__ void kernel(const int *ptr)
{
  printf("Hello, %d\n", *ptr);
}

int main()
{
  int *h_ptr = 0;

  // allocate pinned host memory with cudaMallocHost
  // pinned memory is required for asynchronous copy
  cudaMallocHost(&h_ptr, sizeof(int));

  // look for thirteen in the output
  *h_ptr = 13;

  // allocate device memory
  int *d_ptr = 0;
  cudaMalloc(&d_ptr, sizeof(int));

  // create a stream
  cudaStream_t stream;
  cudaStreamCreate(&stream);

  // sequence the asynchronous copy on our stream
  cudaMemcpyAsync(d_ptr, h_ptr, sizeof(int), cudaMemcpyHostToDevice, stream);

  // sequence the kernel on our stream after the copy
  // the kernel will execute after the copy has completed
  kernel<<<1,1,0,stream>>>(d_ptr);

  // clean up after ourselves
  cudaStreamDestroy(stream);
  cudaFree(d_ptr);
  cudaFreeHost(h_ptr);
}
Run Code Online (Sandbox Code Playgroud)

并输出:

$ nvcc -arch=sm_20 async.cu -run
Hello, 13
Run Code Online (Sandbox Code Playgroud)

我不相信有任何支持的方法可以在内核中测试是否满足某些异步条件(例如异步传输的完成).假设CUDA线程块完全独立于其他执行线程执行.