__ldg()内在函数和正常执行之间有什么区别?

san*_*age 6 cuda

我正在尝试探索'__ldg内在'.我已经通过了NVIDIA的文档,但没有得到任何关于其使用和实现的满意答案.此外,参考这个我尝试在一个简单的1024*1024矩阵乘法示例中实现__ldg.

#include<stdio.h>
#include<stdlib.h>

__global__ void matrix_mul(float * ad,float * bd,float * cd,int N)
{
        float pvalue=0;
        //find Row and Column corresponding to a data element for each thread
        int Row = blockIdx.y * blockDim.y + threadIdx.y;
        int Col = blockIdx.x * blockDim.x + threadIdx.x;
        //calculate dot product of Row of First Matrix and Column of Second Matrix
        for(int i=0;i< N;++i)
        {
//   I tried with executing this first:
            float m=__ldg(&ad[Row * N+i]);
            float n=__ldg(&bd[i * N + Col]);

//Then I executed this as a normal execution:
//          float m = ad[Row * N+i];
//          float n = bd[i * N + Col];

            pvalue += m * n;
         }
        //store dot product at corresponding position in resultant Matrix
        cd[Row * N + Col] = pvalue;
}

int main()
{
    int N = 1024,i,j;               //N == size of square matrix

    float *a,*b;
    float *ad,*bd,*cd,*c;

    //open a file for outputting the result
    FILE *f;
    f=fopen("Parallel Multiply_ldg.txt","w");

    size_t size=sizeof(float)* N * N;

    //allocate host side memory
    a=(float*)malloc(size);
    b=(float*)malloc(size);
    c=(float*)malloc(size);

    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
        {
            a[i*N+j]=2.0;   //(float)(i*N+j);       //initializing each value with its own index
            b[i*N+j]=1.0;   //(float)(i*N+j);       //random functions can be used alternatively
        }
    }

    //allocate device memory
    cudaMalloc(&ad,size);
    //printf("\nAfter cudaMalloc for ad\n%s\n",cudaGetErrorString(cudaGetLastError()));
    cudaMalloc(&bd,size);
    //printf("\nAfter cudaMalloc bd\n%s\n",cudaGetErrorString(cudaGetLastError()));
    cudaMalloc(&cd,size);
    //printf("\nAfter cudaMalloc cd\n%s\n",cudaGetErrorString(cudaGetLastError()));

    //copy value from host to device
    cudaMemcpy(ad,a,size,cudaMemcpyHostToDevice);
    cudaMemcpy(bd,b,size,cudaMemcpyHostToDevice);

    printf("\nAfter HostToDevice Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));

    //calculate execution configuration
    dim3 blocksize(16,16);              //each block contains 16 * 16 (=256) threads
    dim3 gridsize(N/16,N/16);           //creating just sufficient no of blocks

    //GPU timer code
    float time;
    cudaEvent_t start,stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start,0);

    matrix_mul <<< gridsize, blocksize >>> (ad,bd,cd, N);
    cudaDeviceSynchronize();
    cudaEventRecord(stop,0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time,start,stop);         //time taken in kernel call calculated
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    //copy back results
    cudaMemcpy(c,cd,sizeof(float)* N*N,cudaMemcpyDeviceToHost);

    printf("\nAfter DeviceToHost Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));

    //output results in output_file
    fprintf(f,"Array A was---\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            fprintf(f,"%f ",a[i*N+j]);
        fprintf(f,"\n");
    }
    fprintf(f,"\nArray B was---\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            fprintf(f,"%f ",b[i*N+j]);
        fprintf(f,"\n");
    }
    fprintf(f,"\nMultiplication of A and B gives C----\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            fprintf(f,"%f ",c[i*N+j]);              //if correctly computed, then all values must be N
        fprintf(f,"\n");
    }
    printf("\nYou can see output in Parallel Mutiply.txt file in project directory");
    printf("\n\nTime taken is %f (ms)\n",time);
    fprintf(f,"\n\nTime taken is %f (ms)\n",time);
    fclose(f);

    cudaThreadExit();
    //cudaFree(ad); cudaFree(bd); cudaFree (cd);
    free(a);free(b);free(c);
    //_getch();
    return 1;
}
Run Code Online (Sandbox Code Playgroud)

我评论说__ldg部分在我的内核中并通过正常执行执行,反之亦然.在这两种情况下,它都给出了正确的乘法结果.我对这些执行之间的时差感到困惑,因为它的巨大差不多超过100倍!

在__ldg的情况下,它给了我: Time taken is 0.014432 (ms)

如果没有__ldg正常执行,它会给我: Time taken is 36.858398 (ms)

这是使用__ldg内在的确切方法吗?__ldg内在的重要性是什么,使用它的正确方法是什么?显然我在上面的代码中所做的是错误的和幼稚的.我正在寻找解释和例子.提前致谢.

Avi*_*urg 11

来自CUDA C编程指南

计算能力3.x的设备的全局内存访问缓存在L2中,对于计算能力3.5的设备,也可以缓存在前一节中描述的只读数据缓存中; 他们没有在L1中缓存.

...

通过使用该__ldg()函数读取内核的整个生命周期的只读数据,也可以将其缓存在上一节中描述的只读数据高速缓存中(请参阅只读数据高速缓存加载函数).当编译器检测到某些数据满足只读条件时,它将__ldg()用于读取它.编译器可能无法始终检测到某些数据满足只读条件.标记用于使用const__restrict__限定符加载此类数据的指针增加了编译器检测只读条件的可能性.

只读缓存访问的延迟比全局内存访问低得多.因为矩阵乘法多次从内存中访问相同的值,所以只读缓存中的缓存会带来巨大的加速(在内存绑定应用程序中).