What is the difference between mapped memory and managed memory?

Qua*_*ock 4 cuda nvidia gpu-managed-memory

I have been reading about the various approaches to memory management offered by CUDA, and I'm struggling to understand the difference between mapped memory:

int *foo;
std::size_t size = 32;
cudaHostAlloc(&foo, size, cudaHostAllocMapped);
Run Code Online (Sandbox Code Playgroud)

...and managed memory:

int *foo;
std::size_t size = 32;
cudaMallocManaged(&foo, size);
Run Code Online (Sandbox Code Playgroud)

They both appear to implicitly transfer memory between the host and device. cudaMallocManaged seems to be the newer API, and it uses the so-called "Unified Memory" system. That said, cudaHostAlloc seems to share many of these properties on 64-bit systems thanks to the unified virtual address space.

There seem to be a few other differences in documentation, but I am not confident that the absence of explicit feature documentation will lead me to a correct understanding of the differences between these two functions (e.g. I don't believe it is explicitly stated that cudaMallocManaged's host memory is page-locked, but I suspect that it is).

They also correspond to different functions in the driver API (cuMemHostAlloc and cuMemAllocManaged), which I think is a good indicator that their behavior differs in some meaningful way.

小智 5

我认为主要区别在于分页/页面错误机制。

固定内存的作用与普通设备内存相同。如果请求一字节的固定内存,则一字节将通过 PCIe 总线透明地传输到 GPU。(也许驱动程序合并了连续内存位置的请求,我不知道。)

另一方面,托管内存具有内存页的访问粒度。如果设备上不存在所请求字节的页面,则不仅是单个字节,而是整个页面(在许多系统上为 4096 字节)都会从其当前位置迁移到 GPU,当前位置可以是主机内存,也可以是设备内存。另一个 GPU。

以下程序尝试显示不同的行为。分配了 256 MB,相当于 64 * 1024 页,大小为 4096 字节。然后,在内核中,每个线程访问每个页面的第一个字节,即每个第 4096 个字节。该时间针对固定内存、托管内存和普通设备内存进行测量。

#include <iostream>
#include <cassert>

__global__
void kernel(char* __restrict__ data, int pagesize, int numpages){
    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if(tid < numpages){
        data[tid * pagesize] += 1;
    }
}

int main(){
    const int pagesize = 4096;
    const int numpages = 1024 * 64;
    const int bytes = pagesize * numpages;
    cudaError_t status = cudaSuccess;
    float elapsed = 0.0f;
    const int iterations = 5;

    char* devicedata; 
    status = cudaMalloc(&devicedata, bytes);
    assert(status == cudaSuccess);

    char* pinneddata; 
    status = cudaMallocHost(&pinneddata, bytes);
    assert(status == cudaSuccess);

    char* manageddata;
    status = cudaMallocManaged(&manageddata, bytes);
    assert(status == cudaSuccess);

    status = cudaMemPrefetchAsync(manageddata, bytes, cudaCpuDeviceId);
    //status = cudaMemPrefetchAsync(manageddata, bytes, 0);
    assert(status == cudaSuccess);

    cudaEvent_t event1, event2;
    cudaEventCreate(&event1);
    cudaEventCreate(&event2);

    for(int iteration = 0; iteration < iterations; iteration++){
        cudaEventRecord(event1);
        kernel<<<numpages / 256, 256>>>(pinneddata, pagesize, numpages);
        cudaEventRecord(event2);
        status = cudaEventSynchronize(event2);
        assert(status == cudaSuccess);
        cudaEventElapsedTime(&elapsed, event1, event2);
        
        float bandwith = (numpages / elapsed) * 1000.0f / 1024.f / 1024.f;
        std::cerr << "pinned: " << elapsed << ", throughput " << bandwith << " GB/s" << "\n";
    }

    for(int iteration = 0; iteration < iterations; iteration++){
        cudaEventRecord(event1);
        kernel<<<numpages / 256, 256>>>(manageddata, pagesize, numpages);
        cudaEventRecord(event2);
        status = cudaEventSynchronize(event2);
        assert(status == cudaSuccess);
        cudaEventElapsedTime(&elapsed, event1, event2);

        float bandwith = (numpages / elapsed) * 1000.0f / 1024.f / 1024.f;
        std::cerr << "managed: " << elapsed << ", throughput " << bandwith << " MB/s" << "\n";

        status = cudaMemPrefetchAsync(manageddata, bytes, cudaCpuDeviceId);
        assert(status == cudaSuccess);     
    }

    for(int iteration = 0; iteration < iterations; iteration++){
        cudaEventRecord(event1);
        kernel<<<numpages / 256, 256>>>(devicedata, pagesize, numpages);
        cudaEventRecord(event2);
        status = cudaEventSynchronize(event2);
        assert(status == cudaSuccess);
        cudaEventElapsedTime(&elapsed, event1, event2);
        
        float bandwith = (numpages / elapsed) * 1000.0f / 1024.f / 1024.f;
        std::cerr << "device: " << elapsed << ", throughput " << bandwith << " MB/s" << "\n";
    }

    cudaFreeHost(pinneddata);
    cudaFree(manageddata);
    cudaFree(devicedata);
    cudaEventDestroy(event1);
    cudaEventDestroy(event2);

}
Run Code Online (Sandbox Code Playgroud)

当托管内存预取到主机时,会观察到以下时间

pinned: 1.4577 ms, throughput 42.8759 MB/s
pinned: 1.4927 ms, throughput 41.8703 MB/s
pinned: 1.44947 ms, throughput 43.1192 MB/s
pinned: 1.44371 ms, throughput 43.2912 MB/s
pinned: 1.4496 ms, throughput 43.1153 MB/s
managed: 40.3646 ms, throughput 1.54839 MB/s
managed: 35.8052 ms, throughput 1.74555 MB/s
managed: 36.7788 ms, throughput 1.69935 MB/s
managed: 37.3166 ms, throughput 1.67486 MB/s
managed: 35.3378 ms, throughput 1.76864 MB/s
device: 0.052256 ms, throughput 1196.03 MB/s
device: 0.061312 ms, throughput 1019.38 MB/s
device: 0.060736 ms, throughput 1029.04 MB/s
device: 0.060096 ms, throughput 1040 MB/s
device: 0.060352 ms, throughput 1035.59 MB/s
Run Code Online (Sandbox Code Playgroud)

nvprof 确认,在托管内存的情况下,所有 256 MB 都会传输到设备。

==27443== Unified Memory profiling result:
Device "TITAN Xp (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
    6734  38.928KB  4.0000KB  776.00KB  256.0000MB  29.95677ms  Host To Device
Run Code Online (Sandbox Code Playgroud)

当我们删除循环内的预取时,迁移的页面保留在 GPU 上,这将访问时间缩短到正常设备内存的水平。

pinned: 1.46848 ms, throughput 42.561 MB/s
pinned: 1.50842 ms, throughput 41.4342 MB/s
pinned: 1.44285 ms, throughput 43.3171 MB/s
pinned: 1.45802 ms, throughput 42.8665 MB/s
pinned: 1.4431 ms, throughput 43.3094 MB/s
managed: 41.9972 ms, throughput 1.4882 MB/s  <--- need to migrate pages
managed: 0.047584 ms, throughput 1313.47 MB/s <--- pages already present on GPU
managed: 0.059552 ms, throughput 1049.5 MB/s
managed: 0.057248 ms, throughput 1091.74 MB/s
managed: 0.062336 ms, throughput 1002.63 MB/s
device: 0.06176 ms, throughput 1011.98 MB/s
device: 0.062592 ms, throughput 998.53 MB/s
device: 0.062176 ms, throughput 1005.21 MB/s
device: 0.06128 ms, throughput 1019.91 MB/s
device: 0.063008 ms, throughput 991.937 MB/s
Run Code Online (Sandbox Code Playgroud)