我想通过BS_x读取(BS_X + 1)*(BS_Y + 1)全局存储器位置*BS_Y线程将内容移动到共享存储器,我开发了以下代码.
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
int index1 = j*BLOCK_SIZE_Y+i;
int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);
int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
__shared__ double Ezx_h_shared_ext[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];
Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];
if ((i2<(BLOCK_SIZE_X+1))&&(j2<(BLOCK_SIZE_Y+1)))
Ezx_h_shared_ext[i2][j2]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j2)*xdim+(blockIdx.x*BLOCK_SIZE_X+i2)];
Run Code Online (Sandbox Code Playgroud)
根据我的理解,合并是顺序处理的连续内存读取的并行等价物.现在如何检测全局内存访问是否已合并?我注意到从(i1,j1)到(i2,j2)的索引跳转.提前致谢.
我用手写的合并分析器评估了代码的内存访问.评估显示代码较少利用合并.以下是您可能会发现有用的合并分析器:
#include <stdio.h>
#include <malloc.h>
typedef struct dim3_t{
int x;
int y;
} dim3;
// KERNEL LAUNCH PARAMETERS
#define GRIDDIMX 4
#define GRIDDIMY 4
#define BLOCKDIMX 16
#define BLOCKDIMY 16
// ARCHITECTURE DEPENDENT
// number of threads aggregated for coalescing
#define COALESCINGWIDTH 32
// number of bytes in one coalesced transaction
#define CACHEBLOCKSIZE 128
#define CACHE_BLOCK_ADDR(addr,size) (addr*size)&(~(CACHEBLOCKSIZE-1))
int main(){
// fixed dim3 variables
// grid and block size
dim3 blockDim,gridDim;
blockDim.x=BLOCKDIMX;
blockDim.y=BLOCKDIMY;
gridDim.x=GRIDDIMX;
gridDim.y=GRIDDIMY;
// counters
int unq_accesses=0;
int *unq_addr=(int*)malloc(sizeof(int)*COALESCINGWIDTH);
int total_unq_accesses=0;
// iter over total number of threads
// and count the number of memory requests (the coalesced requests)
int I, II, III;
for(I=0; I<GRIDDIMX*GRIDDIMY; I++){
dim3 blockIdx;
blockIdx.x = I%GRIDDIMX;
blockIdx.y = I/GRIDDIMX;
for(II=0; II<BLOCKDIMX*BLOCKDIMY; II++){
if(II%COALESCINGWIDTH==0){
// new coalescing bunch
total_unq_accesses+=unq_accesses;
unq_accesses=0;
}
dim3 threadIdx;
threadIdx.x=II%BLOCKDIMX;
threadIdx.y=II/BLOCKDIMX;
////////////////////////////////////////////////////////
// Change this section to evaluate different accesses //
////////////////////////////////////////////////////////
// do your indexing here
#define BLOCK_SIZE_X BLOCKDIMX
#define BLOCK_SIZE_Y BLOCKDIMY
#define xdim 32
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
int index1 = j*BLOCK_SIZE_Y+i;
int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);
int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
// calculate the accessed location and offset here
// change the line "Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];" to
int addr = (blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1);
int size = sizeof(double);
//////////////////////////
// End of modifications //
//////////////////////////
printf("tid (%d,%d) from blockid (%d,%d) accessing to block %d\n",threadIdx.x,threadIdx.y,blockIdx.x,blockIdx.y,CACHE_BLOCK_ADDR(addr,size));
// check whether it can be merged with existing requests or not
short merged=0;
for(III=0; III<unq_accesses; III++){
if(CACHE_BLOCK_ADDR(addr,size)==CACHE_BLOCK_ADDR(unq_addr[III],size)){
merged=1;
break;
}
}
if(!merged){
// new cache block accessed over this coalescing width
unq_addr[unq_accesses]=CACHE_BLOCK_ADDR(addr,size);
unq_accesses++;
}
}
}
printf("%d threads make %d memory transactions\n",GRIDDIMX*GRIDDIMY*BLOCKDIMX*BLOCKDIMY, total_unq_accesses);
}
Run Code Online (Sandbox Code Playgroud)
代码将针对网格的每个线程运行,并计算合并请求的数量,内存访问合并的度量.
要使用分析器,请将代码的索引计算部分粘贴到指定区域,并将内存访问(数组)分解为"地址"和"大小".我已经为你的代码做了这个,索引是:
int i = threadIdx.x;
int j = threadIdx.y;
int idx = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
int index1 = j*BLOCK_SIZE_Y+i;
int i1 = (index1)%(BLOCK_SIZE_X+1);
int j1 = (index1)/(BLOCK_SIZE_Y+1);
int i2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2 = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
Run Code Online (Sandbox Code Playgroud)
和内存访问是:
Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];
Run Code Online (Sandbox Code Playgroud)
分析器报告4096个线程访问4064个缓存块.运行实际网格和块大小的代码并分析合并行为.