我正在尝试探索'__ldg内在'.我已经通过了NVIDIA的文档,但没有得到任何关于其使用和实现的满意答案.此外,参考这个我尝试在一个简单的1024*1024矩阵乘法示例中实现__ldg.
#include<stdio.h>
#include<stdlib.h>
__global__ void matrix_mul(float * ad,float * bd,float * cd,int N)
{
float pvalue=0;
//find Row and Column corresponding to a data element for each thread
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
//calculate dot product of Row of First Matrix and Column of Second Matrix
for(int i=0;i< N;++i)
{
// I tried with executing this first:
float m=__ldg(&ad[Row * N+i]);
float n=__ldg(&bd[i * N + Col]);
//Then I executed this as a normal execution:
// float m = ad[Row * N+i];
// float n = bd[i * N + Col];
pvalue += m * n;
}
//store dot product at corresponding position in resultant Matrix
cd[Row * N + Col] = pvalue;
}
int main()
{
int N = 1024,i,j; //N == size of square matrix
float *a,*b;
float *ad,*bd,*cd,*c;
//open a file for outputting the result
FILE *f;
f=fopen("Parallel Multiply_ldg.txt","w");
size_t size=sizeof(float)* N * N;
//allocate host side memory
a=(float*)malloc(size);
b=(float*)malloc(size);
c=(float*)malloc(size);
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
a[i*N+j]=2.0; //(float)(i*N+j); //initializing each value with its own index
b[i*N+j]=1.0; //(float)(i*N+j); //random functions can be used alternatively
}
}
//allocate device memory
cudaMalloc(&ad,size);
//printf("\nAfter cudaMalloc for ad\n%s\n",cudaGetErrorString(cudaGetLastError()));
cudaMalloc(&bd,size);
//printf("\nAfter cudaMalloc bd\n%s\n",cudaGetErrorString(cudaGetLastError()));
cudaMalloc(&cd,size);
//printf("\nAfter cudaMalloc cd\n%s\n",cudaGetErrorString(cudaGetLastError()));
//copy value from host to device
cudaMemcpy(ad,a,size,cudaMemcpyHostToDevice);
cudaMemcpy(bd,b,size,cudaMemcpyHostToDevice);
printf("\nAfter HostToDevice Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));
//calculate execution configuration
dim3 blocksize(16,16); //each block contains 16 * 16 (=256) threads
dim3 gridsize(N/16,N/16); //creating just sufficient no of blocks
//GPU timer code
float time;
cudaEvent_t start,stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
matrix_mul <<< gridsize, blocksize >>> (ad,bd,cd, N);
cudaDeviceSynchronize();
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop); //time taken in kernel call calculated
cudaEventDestroy(start);
cudaEventDestroy(stop);
//copy back results
cudaMemcpy(c,cd,sizeof(float)* N*N,cudaMemcpyDeviceToHost);
printf("\nAfter DeviceToHost Memcpy\n%s\n",cudaGetErrorString(cudaGetLastError()));
//output results in output_file
fprintf(f,"Array A was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",a[i*N+j]);
fprintf(f,"\n");
}
fprintf(f,"\nArray B was---\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",b[i*N+j]);
fprintf(f,"\n");
}
fprintf(f,"\nMultiplication of A and B gives C----\n");
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
fprintf(f,"%f ",c[i*N+j]); //if correctly computed, then all values must be N
fprintf(f,"\n");
}
printf("\nYou can see output in Parallel Mutiply.txt file in project directory");
printf("\n\nTime taken is %f (ms)\n",time);
fprintf(f,"\n\nTime taken is %f (ms)\n",time);
fclose(f);
cudaThreadExit();
//cudaFree(ad); cudaFree(bd); cudaFree (cd);
free(a);free(b);free(c);
//_getch();
return 1;
}
Run Code Online (Sandbox Code Playgroud)
我评论说__ldg部分在我的内核中并通过正常执行执行,反之亦然.在这两种情况下,它都给出了正确的乘法结果.我对这些执行之间的时差感到困惑,因为它的巨大差不多超过100倍!
在__ldg的情况下,它给了我: Time taken is 0.014432 (ms)
如果没有__ldg正常执行,它会给我: Time taken is 36.858398 (ms)
这是使用__ldg内在的确切方法吗?__ldg内在的重要性是什么,使用它的正确方法是什么?显然我在上面的代码中所做的是错误的和幼稚的.我正在寻找解释和例子.提前致谢.
Avi*_*urg 11
计算能力3.x的设备的全局内存访问缓存在L2中,对于计算能力3.5的设备,也可以缓存在前一节中描述的只读数据缓存中; 他们没有在L1中缓存.
...
通过使用该
__ldg()函数读取内核的整个生命周期的只读数据,也可以将其缓存在上一节中描述的只读数据高速缓存中(请参阅只读数据高速缓存加载函数).当编译器检测到某些数据满足只读条件时,它将__ldg()用于读取它.编译器可能无法始终检测到某些数据满足只读条件.标记用于使用const和__restrict__限定符加载此类数据的指针增加了编译器检测只读条件的可能性.
只读缓存访问的延迟比全局内存访问低得多.因为矩阵乘法多次从内存中访问相同的值,所以只读缓存中的缓存会带来巨大的加速(在内存绑定应用程序中).