对于Julia set,CUDA内核与CPU版本的性能降低

Question

对于Julia set,CUDA内核与CPU版本的性能降低

Mic*_* IV 2 c++ cuda

我正在从"CUDA by example" 一书中学习CUDA .在第4章中有一个生成Julia分形的演示.展示了CPU和GPU版本.我决定添加时间来查看两种情况的执行速度和令我惊讶的是,CPU版本执行速度比GPU快3倍.

CPU Julia生成总时间:

745毫秒.

GPU Julia一代总时间:

2456毫秒.

那么发生了什么？很明显,至少从CUDA内核代码中执行是并行的,因为它分布在1000个块中,每个块计算1000x1000分辨率最终图像的像素.

以下是实现的源代码:

 #define N 10
 #define DIM 1000
 typedef unsigned char byte;

struct cuComplex {
   float   r;
   float   i;
   __host__ __device__ cuComplex( float a, float b ) : r(a), i(b)  {}
   __host__  __device__ float magnitude2( void ) {
          return r * r + i * i;
   }
   __host__ __device__ cuComplex operator*(const cuComplex& a) {
        return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
   }
    __host__ __device__ cuComplex operator+(const cuComplex& a) {
        return cuComplex(r+a.r, i+a.i);
   }
};

 __device__ int juliaGPU(int x , int y){
    const float scale =1.3;
    float jx = scale * (float)(DIM/2 -x)/(DIM/2);
    float jy=  scale *(float)(DIM/2 -y)/(DIM/2);

    cuComplex c(-0.8 ,0.156);
    cuComplex a(jx ,jy);
    int i = 0;
    for(i=0; i <200;i++){
        a = a * a +c;
        if(a.magnitude2() >1000){

            return 0;
        }
    }
    return 1;

 }

 __global__ void kernelGPU(byte *ptr){
    int x = blockIdx.x;
    int y = blockIdx.y;
    int offset =x + y * gridDim.x;

    int juliaValue =juliaGPU(x , y);
    ptr[offset * 4 + 0]=255 * juliaValue;
    ptr[offset * 4 + 1]=0;
    ptr[offset * 4 + 2]=0;
    ptr[offset * 4 + 3]=255 ;
}


 struct DataBlock {
    unsigned char   *dev_bitmap;
};
 void juliaGPUTestSample(){
 DataBlock   data;
CPUBitmap bitmap(DIM,DIM);
byte *dev_bitmap; //memory on GPU 
HANDLE_ERROR(cudaMalloc((void**)&dev_bitmap , bitmap.image_size()) );
data.dev_bitmap =dev_bitmap;
dim3 grid(DIM,DIM);
int starTime=glutGet(GLUT_ELAPSED_TIME);

kernelGPU<<<grid ,1 >>>(dev_bitmap);
HANDLE_ERROR(cudaMemcpy(bitmap.get_ptr() , dev_bitmap ,bitmap.image_size() ,cudaMemcpyDeviceToHost ) );
int endTime=glutGet(GLUT_ELAPSED_TIME)-starTime;
printf("Total time %d\n:" ,endTime);
HANDLE_ERROR(cudaFree(dev_bitmap));

bitmap.display_and_exit();
 }

 int main(void){
juliaGPUTestSample();
return 1;

 }

Run Code Online (Sandbox Code Playgroud)

这是CPU版本:

///"cuComplex"结构与上面相同.

int julia (int x , int y){

const float scale = 1.3;
float jx = scale * (float)(DIM/2 -x)/(DIM/2);
float jy = scale * (float)(DIM/2 -y)/(DIM/2);

cuComplex c(-0.8 ,0.156);
cuComplex a(jx ,jy);

int i = 0;
for(i=0; i <200;i++){

    a = a * a +c;
    if(a.magnitude2() >1000){

        return 0;
    }
}

return 1;

}

void kernel(unsigned char *ptr){

for(int y = 0 ; y <DIM ;++y){
    for(int x = 0 ; x <DIM ; ++x){
        int offset =x + y * DIM;
        int juliaValue = julia(x , y);

        ptr[offset * 4 + 0 ] = juliaValue * 125;
        ptr[offset * 4 + 1 ] = juliaValue * x;
        ptr[offset * 4 + 2 ] = juliaValue * y;
        ptr[offset * 4 + 3 ] = 255 ;
    }
}

}
void juliaCPUTestSample(){

CPUBitmap bitmap(DIM ,DIM);
unsigned char *ptr = bitmap.get_ptr();
int starTime=glutGet(GLUT_ELAPSED_TIME);

kernel(ptr);

int endTime=glutGet(GLUT_ELAPSED_TIME)-starTime;
printf("Total time %d\n:" ,endTime);
bitmap.display_and_exit();

}

Run Code Online (Sandbox Code Playgroud)

更新 - 系统配置:

Windows 7 64位

CPU - Intel i7 -3770CPU 3.40GHz,16GB RAM

GPU - NVidia Quadro 4000

Answer 1

Rob*_*lla 9

其他人注意到这一点.

首先,在讨论CPU和GPU之间的性能比较时,最好提一下系统配置,包括hw平台和软件.例如,我在配备核心i7 2.60GHz四核CPU和运行RHEL 6.2和cuda 5.0的四核1000M GPU的惠普笔记本电脑上运行代码,GPU得分为438,CPU得分为441.

其次,更重要的是,该书中的朱莉亚样本是CUDA编码的一个相对早期的例子,因此它并不是真正面向最大性能,而是为了说明到目前为止所讨论的概念.那本书和其他各种CUDA教程材料首先在块级引入了使用CUDA的并行编程.对此的指示如下:

kernelGPU<<<grid ,1 >>>(dev_bitmap);

Run Code Online (Sandbox Code Playgroud)

内核启动参数<<<grid, 1>>>指示将启动某个数字的网格(grid在这种情况下,总共100万个块)块,每个块具有单个线程.与使用每个线程块的完整线程补充的网格相比,这立即降低了费米级GPU的功率,例如,降低了1/32倍.Fermi级GPU中的每个SM都有32个线程处理器,都是锁步执行的.如果你推出一个块只有16它的线程,然后16级线程的处理器将执行代码和其他16个线程处理器不会做任何事情(即没有什么用处).因此,仅包含1个线程的线程块将仅使用32个线程处理器中的1个,另外31个空闲.

因此,这个特定的代码示例没有很好地设计为利用GPU的完全并行功能.鉴于本书中对CUDA概念的阐述相对较早,这是可以理解的; 我不相信作者打算将此代码进行基准测试或用作如何在GPU上编写快速代码的合法表示.

根据1/32的因素,在您的系统上CPU的速度只有3倍,而在我的系统上,CPU和GPU具有可比的吞吐量(这些都是特别高性能的CUDA GPU,最有可能)我认为它在相当好的光线下显示了GPU.GPU正在与这场战斗作斗争,其中97%的功能尚未使用.

归档时间：	13 年，3 月前
查看次数：	2050 次
最近记录：	10 年，8 月前