cudaMalloc 改成异步了吗？

Question

cudaMalloc 改成异步了吗？

我在其他地方读到 cudaMalloc 将跨内核同步。（例如， cudaMalloc 会同步主机和设备吗？）但是，我刚刚测试了这段代码，根据我在可视化分析器中看到的内容，cudaMalloc 似乎没有同步。如果将 cudaFree 添加到循环中，则会同步。我正在使用 CUDA 7.5。有谁知道 cudaMalloc 是否改变了它的行为？或者我错过了一些微妙之处？非常感谢！

__global__ void slowKernel()
{
  float input = 5;
  for( int i = 0; i < 1000000; i++ ){
    input = input * .9999999;
  }
}

__global__ void fastKernel()
{
  float input = 5;
  for( int i = 0; i < 100000; i++ ){
    input = input * .9999999;
  }
}

void mallocSynchronize(){
  cudaStream_t stream1, stream2;
  cudaStreamCreate( &stream1 );
  cudaStreamCreate( &stream2 );
  slowKernel <<<1, 1, 0, stream1 >>>();
  int *dev_a = 0;
  for( int i = 0; i < 10; i++ ){
    cudaMalloc( &dev_a, 4 * 1024 * 1024 );
    fastKernel <<<1, 1, 0, stream2 >>>();
    // cudaFree( dev_a ); // If you uncomment this, the second fastKernel launch will wait until slowKernel completes
  }
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

tal*_*ies 1

你的方法是有缺陷的，但你的结论对我来说看起来是正确的（如果你查看你的配置文件数据，你应该看到长内核和短内核都花费相同的时间并且运行得非常快，因为积极的编译器优化正在消除所有代码在这两种情况下）。

我把你的例子变成了更合理的东西

#include <time.h>
__global__ void slowKernel(float *output, bool write=false)
{
    float input = 5;
#pragma unroll
    for( int i = 0; i < 10000000; i++ ){
        input = input * .9999999;
    }
    if (write) *output -= input;
}

__global__ void fastKernel(float *output, bool write=false)
{
    float input = 5;
#pragma unroll
    for( int i = 0; i < 100000; i++ ){
        input = input * .9999999;
    }
    if (write) *output -= input;
}

void burntime(long val) {
    struct timespec tv[] = {{0, val}};
    nanosleep(tv, 0);
}

void mallocSynchronize(){
    cudaStream_t stream1, stream2;
    cudaStreamCreate( &stream1 );
    cudaStreamCreate( &stream2 );
    const size_t sz = 1 << 21;
    slowKernel <<<1, 1, 0, stream1 >>>((float *)(0));
    burntime(500000000L); // 500ms wait - slowKernel around 1300ms
    int *dev_a = 0;
    for( int i = 0; i < 10; i++ ){
        cudaMalloc( &dev_a, sz );
        fastKernel <<<1, 1, 0, stream2 >>>((float *)(0));
        burntime(1000000L); // 1ms wait - fastKernel around 15ms
    }
}

int main()
{
    mallocSynchronize();
    cudaDeviceSynchronize();
    cudaDeviceReset();
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

[注意需要 POSIX 时间函数，因此无法在 Windows 上运行]

在相当快的 Maxwell 设备 (GTX970) 上，我看到cudaMalloc循环中的调用与slowKernel配置文件跟踪中仍在执行的调用重叠，然后与fastKernel另一个流中正在运行的调用重叠。我愿意接受最初的结论，即微小的时间变化可能会导致您在损坏的示例中看到的效果。然而，在此代码中，主机和设备跟踪之间的同步时间偏移 0.5 秒似乎不太可能。您可能需要改变调用的持续时间burntime才能获得相同的效果，具体取决于 GPU 的速度。

所以这是一个很长的说法，是的，它看起来像是 Linux 上使用 CUDA 7.5 和 Maxwell 设备的非同步调用。我不认为情况总是如此，但据我所知，文档从未说过是否应该阻止/同步。我无法访问较旧的 CUDA 版本和支持的硬件，无法了解此示例如何使用较旧的驱动程序和 Fermi 或 Kepler 设备。

归档时间：	9 年，8 月前
查看次数：	3138 次
最近记录：	8 年，2 月前