用CUDA提取矩阵列?

Hie*_*ham 0 c cuda matrix extraction gpu-programming

使用nvprof,我发现以下内核是我的CUDA应用程序的瓶颈

__global__ void extractColumn_kernel(real_t *tgt, real_t *src, int *indices, int numRows, int len) {
  int stride = gridDim.x * blockDim.x;
  int tid = blockDim.x * blockIdx.x + threadIdx.x;
  for (int j = tid; j < len; j += stride) {
    int colId = j / numRows;
    int rowId = j % numRows;
    tgt[j] = src[indices[colId]*numRows + rowId];
  }
}
Run Code Online (Sandbox Code Playgroud)

它打算将src列出的矩阵的列提取indices到矩阵中tgt.请注意,矩阵srctgt两者都有numRows行,并以列主要维度存储.此外,len = length(indices)*numRows是矩阵的条目总数tgt.

我的问题:有没有更有效的方法来做到这一点?还提到对旧问题的参考.我很惊讶我以前找不到这个问题,因为这是tgt = src(:,indices(:));MATLAB中常用的操作.

万分感谢!

Rob*_*lla 5

仅作为复制内核,最佳可能性能将受可用内存带宽的限制.通过运行bandwidthTest cuda示例代码并参考设备到设备传输报告的数量(将根据GPU而变化),可以获得对此的粗略估计.

你的内核已经写得很好了,加载和存储操作都应该很好地合并.这是由代码的检查显而易见的,但你可以通过运行证明这一点给自己nvprof--metrics gld_efficiency,也有运行--metrics gst_efficiency.两个数字都应该接近100%.(根据我的测试,他们.)

所以在我的情况下,当我在Quadro5000 GPU上运行内核时,考虑传输大小并除以内核执行时间,我得到的数字大约是可用带宽的60%.你的内核中没有其他的东西,所以我们要专注于循环中的这两行:

int colId = j / numRows;
int rowId = j % numRows;
Run Code Online (Sandbox Code Playgroud)

事实证明,整数除法和模数在GPU上都相当昂贵; 它们是由编译器生成的指令序列创建的 - 没有本机除法或模数机器指令.因此,如果我们能够找到在主循环中摆脱这些的方法,我们可能能够接近bandwidthTest(设备到设备)报告的100%带宽的目标.

由于循环中的增量是固定的(stride),我相信我们可以预先计算(大部分)需要添加到循环的每次迭代的增量,colIdrowId在循环内使用加法和减法,而不是除法. .修改后的内核看起来像这样:

__global__ void my_extractColumn_kernel(real_t *tgt, real_t *src, int *indices, int numRows, int len) {
  int stride = gridDim.x * blockDim.x;
  int div = stride/numRows;
  int rem = stride%numRows;
  int tid = blockDim.x * blockIdx.x + threadIdx.x;
  int colId = tid / numRows;
  int rowId = tid % numRows;
  for (int j = tid; j < len; j += stride) {
    tgt[j] = src[indices[colId]*numRows + rowId];
    colId += div;
    rowId += rem;
    if (rowId >= numRows) {rowId-=numRows; colId++;}
  }
}
Run Code Online (Sandbox Code Playgroud)

因此,通过预先计算循环的每次迭代的增量,我们可以避免主循环中的"昂贵"除法类型操作.性能呢?该内核更接近100%带宽目标.这是一个完整的测试代码:

#include <stdio.h>

#define NUMR 1000
#define NUMC 20000
#define DSIZE (NUMR*NUMC)
#define EXTC 10000
#define EXSZ (NUMR*EXTC)

#define nTPB 256
#define nBLK 64
typedef float real_t;

__global__ void extractColumn_kernel(real_t *tgt, real_t *src, int *indices, int numRows, int len) {
  int stride = gridDim.x * blockDim.x;
  int tid = blockDim.x * blockIdx.x + threadIdx.x;
  for (int j = tid; j < len; j += stride) {
    int colId = j / numRows;
    int rowId = j % numRows;
    tgt[j] = src[indices[colId]*numRows + rowId];
  }
}

__global__ void my_extractColumn_kernel(real_t *tgt, real_t *src, int *indices, int numRows, int len) {
  int stride = gridDim.x * blockDim.x;
  int div = stride/numRows;
  int rem = stride%numRows;
  int tid = blockDim.x * blockIdx.x + threadIdx.x;
  int colId = tid / numRows;
  int rowId = tid % numRows;
  for (int j = tid; j < len; j += stride) {
    tgt[j] = src[indices[colId]*numRows + rowId];
    colId += div;
    rowId += rem;
    if (rowId >= numRows) {rowId-=numRows; colId++;}
  }
}


__global__ void copy_kernel(real_t *tgt, real_t *src, int len){
  int tid = threadIdx.x+blockDim.x*blockIdx.x;
  while (tid < len){
    tgt[tid] = src[tid];
    tid+=blockDim.x*gridDim.x;
  }
}


int main(){

  real_t *h_a, *d_a, *h_b, *d_b, *h_bi;
  h_a = (real_t *) malloc(DSIZE*sizeof(real_t));
  cudaMalloc(&d_a, DSIZE*sizeof(real_t));
  h_b = (real_t *) malloc(EXSZ*sizeof(real_t));
  cudaMalloc(&d_b, EXSZ*sizeof(real_t));
  h_bi = (real_t *) malloc(EXSZ*sizeof(real_t));
  int *h_ind, *d_ind;
  h_ind = (int *) malloc(EXTC*sizeof(int));
  cudaMalloc(&d_ind, EXTC*sizeof(int));

  for (int i = 0; i < EXTC; i++) h_ind[i] = i;
  for (int i = 0; i < DSIZE; i++) h_a[i] = i;

  cudaMemcpy(d_a, h_a, DSIZE*sizeof(real_t), cudaMemcpyHostToDevice);
  cudaMemcpy(d_ind, h_ind, EXTC*sizeof(int), cudaMemcpyHostToDevice);
  extractColumn_kernel<<<nBLK, nTPB>>>(d_b, d_a, d_ind, NUMR, NUMR*EXTC);
  cudaMemcpy(h_b, d_b, EXSZ*sizeof(real_t), cudaMemcpyDeviceToHost);
  copy_kernel<<<nBLK, nTPB>>>(d_b, d_a, NUMR*EXTC);
  cudaDeviceSynchronize();
  my_extractColumn_kernel<<<nBLK, nTPB>>>(d_b, d_a, d_ind, NUMR, NUMR*EXTC);
  cudaMemcpy(h_bi, d_b, EXSZ*sizeof(real_t), cudaMemcpyDeviceToHost);
  for (int i = 0; i < EXSZ; i++)
    if (h_bi[i] != h_b[i]) {printf("mismatch at %d, was: %f, should be: %f\n", i, h_bi[i], h_b[i]); return 1;}
  printf("Success!\n");
  return 0;
}
Run Code Online (Sandbox Code Playgroud)

我已经在你的内核,我的内核中进行了测试,并且在"复制内核"之间进行了测试,该复制内核只执行相同数量的数据的纯副本.这有助于我们确认我们对可用带宽上限的看法(见下文).

现在的表现数据. bandwidthTest在这个GPU上告诉我们:

$ /usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro 5000
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     5855.8

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     6334.8

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     101535.4

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$
Run Code Online (Sandbox Code Playgroud)

所以我们有大约100GB/s的可用带宽.现在nvprof --print-gpu-trace我们看到:

$ nvprof --print-gpu-trace ./t822
==17985== NVPROF is profiling process 17985, command: ./t822
Success!
==17985== Profiling application: ./t822
==17985== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
781.98ms  29.400ms                    -               -         -         -         -  80.000MB  2.7211GB/s  Quadro 5000 (0)         1         7  [CUDA memcpy HtoD]
811.40ms  9.0560us                    -               -         -         -         -  40.000KB  4.4170GB/s  Quadro 5000 (0)         1         7  [CUDA memcpy HtoD]
811.44ms  1.3377ms             (64 1 1)       (256 1 1)        15        0B        0B         -           -  Quadro 5000 (0)         1         7  extractColumn_kernel(float*, float*, int*, int, int) [188]
812.78ms  21.614ms                    -               -         -         -         -  40.000MB  1.8507GB/s  Quadro 5000 (0)         1         7  [CUDA memcpy DtoH]
834.94ms  816.10us             (64 1 1)       (256 1 1)         9        0B        0B         -           -  Quadro 5000 (0)         1         7  copy_kernel(float*, float*, int) [194]
835.77ms  911.39us             (64 1 1)       (256 1 1)        18        0B        0B         -           -  Quadro 5000 (0)         1         7  my_extractColumn_kernel(float*, float*, int*, int, int) [202]
836.69ms  20.661ms                    -               -         -         -         -  40.000MB  1.9360GB/s  Quadro 5000 (0)         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$
Run Code Online (Sandbox Code Playgroud)

这里的传输大小是1000行*10000列*4字节/元素*每个元素传输2次(一次读取,一次写入)= 80,000,000字节.您的原始内核在1.34ms内传输此数据,平均带宽约为60GB/s."纯副本"内核在0.816ms内传输相同的数据,平均带宽为98GB/s - 非常接近我们100GB/s的目标.我修改的"列复制"内核需要0.911ms,所以它提供大约88GB/s.

寻找更多表现?由于我的计算变量remdiv是相同的每个线程,你可以预先计算这些量在主机代码,并把它们传递的内核参数.我不确定它会有多大影响,但你可以尝试一下.您现在有一个评估性能影响的路线图(如果有).

笔记:

  1. 请注意,我相信我修改内核中更新索引的逻辑是合理的,但我还没有详尽地测试它.它通过我在这里介绍的简单测试用例.

  2. 我省略了正确的cuda错误检查.但是,如果您遇到问题,则应使用它,和/或运行代码cuda-memcheck.

  3. 正如我所提到的,你的内核已经从一个合并的角度编写得非常好,所以它已经实现了大约60%的"最佳案例".因此,如果你在这里寻找2倍速,5倍速或10倍速的加速,你将无法找到它,并且期望它是不合理的.

编辑:进一步改进

在这种情况下,我们没有更接近纯副本内核的一个可能原因是由于这种间接性:

    tgt[j] = src[indices[colId]*numRows + rowId];
                 ^^^^^^^^^^^^^^
Run Code Online (Sandbox Code Playgroud)

read from indices(全局变量)表示我们的纯副本内核不必执行的"额外"数据访问.可能还有一些聪明的方法可以优化对此访问的处理.由于它将是重复的(与读取src和写入不同tgt),它表明可能有一些使用缓存或专用内存可能有所帮助.

如果我们仔细检查访问的性质,我们可以观察到(对于相当大的矩阵),在大多数情况下,对于经线中的线程,访问indices将是均匀的.这意味着通常,在给定的warp中,所有线程将具有相同的值colId,因此将从中请求相同的元素indices.这种类型的模式表明使用CUDA __constant__内存可能进行优化.这里所需的变化并不广泛; 我们基本上需要将indices数据移动到__constant__数组中.

这是一个修改过的代码:

$ cat t822.cu
#include <stdio.h>

#define NUMR 1000
#define NUMC 20000
#define DSIZE (NUMR*NUMC)
#define EXTC 10000
#define EXSZ (NUMR*EXTC)

#define nTPB 256
#define nBLK 64



typedef float real_t;

__constant__ int my_indices[EXTC];

__global__ void extractColumn_kernel(real_t *tgt, real_t *src, int *indices, int numRows, int len) {
  int stride = gridDim.x * blockDim.x;
  int tid = blockDim.x * blockIdx.x + threadIdx.x;
  for (int j = tid; j < len; j += stride) {
    int colId = j / numRows;
    int rowId = j % numRows;
    tgt[j] = src[indices[colId]*numRows + rowId];
  }
}

__global__ void my_extractColumn_kernel(real_t *tgt, real_t *src, int numRows, int len, int div, int rem) {
  int stride = gridDim.x * blockDim.x;
  int tid = blockDim.x * blockIdx.x + threadIdx.x;
  int colId = tid / numRows;
  int rowId = tid % numRows;
  for (int j = tid; j < len; j += stride) {
    tgt[j] = src[my_indices[colId]*numRows + rowId];
    colId += div;
    rowId += rem;
    if (rowId >= numRows) {rowId-=numRows; colId++;}
  }
}


__global__ void copy_kernel(real_t *tgt, real_t *src, int len){
  int tid = threadIdx.x+blockDim.x*blockIdx.x;
  while (tid < len){
    tgt[tid] = src[tid];
    tid+=blockDim.x*gridDim.x;
  }
}


int main(){

  real_t *h_a, *d_a, *h_b, *d_b, *h_bi;
  h_a = (real_t *) malloc(DSIZE*sizeof(real_t));
  cudaMalloc(&d_a, DSIZE*sizeof(real_t));
  h_b = (real_t *) malloc(EXSZ*sizeof(real_t));
  cudaMalloc(&d_b, EXSZ*sizeof(real_t));
  h_bi = (real_t *) malloc(EXSZ*sizeof(real_t));
  int *h_ind, *d_ind;
  h_ind = (int *) malloc(EXTC*sizeof(int));
  cudaMalloc(&d_ind, EXTC*sizeof(int));

  for (int i = 0; i < EXTC; i++) h_ind[i] = i;
  for (int i = 0; i < DSIZE; i++) h_a[i] = i;

  cudaMemcpy(d_a, h_a, DSIZE*sizeof(real_t), cudaMemcpyHostToDevice);
  cudaMemcpy(d_ind, h_ind, EXTC*sizeof(int), cudaMemcpyHostToDevice);
  extractColumn_kernel<<<nBLK, nTPB>>>(d_b, d_a, d_ind, NUMR, NUMR*EXTC);
  cudaMemcpy(h_b, d_b, EXSZ*sizeof(real_t), cudaMemcpyDeviceToHost);
  copy_kernel<<<nBLK, nTPB>>>(d_b, d_a, NUMR*EXTC);
  cudaDeviceSynchronize();
  cudaMemcpyToSymbol(my_indices, h_ind, EXTC*sizeof(int));
  int mydiv = (nBLK*nTPB)/NUMR;
  int myrem = (nBLK*nTPB)%NUMR;
  my_extractColumn_kernel<<<nBLK, nTPB>>>(d_b, d_a, NUMR, NUMR*EXTC, mydiv, myrem);
  cudaMemcpy(h_bi, d_b, EXSZ*sizeof(real_t), cudaMemcpyDeviceToHost);
  for (int i = 0; i < EXSZ; i++)
    if (h_bi[i] != h_b[i]) {printf("mismatch at %d, was: %f, should be: %f\n", i, h_bi[i], h_b[i]); return 1;}
  printf("Success!\n");
  return 0;
}


$
Run Code Online (Sandbox Code Playgroud)

并从性能结果:

$ nvprof --print-gpu-trace ./t822
==18998== NVPROF is profiling process 18998, command: ./t822
Success!
==18998== Profiling application: ./t822
==18998== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
773.01ms  28.300ms                    -               -         -         -         -  80.000MB  2.8269GB/s  Quadro 5000 (0)         1         7  [CUDA memcpy HtoD]
801.33ms  9.0240us                    -               -         -         -         -  40.000KB  4.4326GB/s  Quadro 5000 (0)         1         7  [CUDA memcpy HtoD]
801.38ms  1.3001ms             (64 1 1)       (256 1 1)        15        0B        0B         -           -  Quadro 5000 (0)         1         7  extractColumn_kernel(float*, float*, int*, int, int) [188]
802.68ms  20.773ms                    -               -         -         -         -  40.000MB  1.9256GB/s  Quadro 5000 (0)         1         7  [CUDA memcpy DtoH]
823.98ms  811.75us             (64 1 1)       (256 1 1)         9        0B        0B         -           -  Quadro 5000 (0)         1         7  copy_kernel(float*, float*, int) [194]
824.82ms  8.9920us                    -               -         -         -         -  40.000KB  4.4484GB/s  Quadro 5000 (0)         1         7  [CUDA memcpy HtoD]
824.83ms  824.65us             (64 1 1)       (256 1 1)        13        0B        0B         -           -  Quadro 5000 (0)         1         7  my_extractColumn_kernel(float*, float*, int, int, int, int) [204]
825.66ms  21.023ms                    -               -         -         -         -  40.000MB  1.9027GB/s  Quadro 5000 (0)         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
Run Code Online (Sandbox Code Playgroud)

我们看到我们现在几乎达到了100%利用可用带宽的目标(824 us,复制内核时间为811 us). __constant__内存总量限制为64KB,因此这意味着只有索引(实际上是需要复制的列数)大约小于16,000才能以这种方式使用.