我有一个内核,它计算总和.如果我通过内核计算声明的变量数量,我会假设每个内核总共有5个寄存器*.但是,在分析内核时,使用了34个寄存器.我需要降低到30个寄存器以允许执行1024个线程.
任何人都可以看到有什么问题吗?
__global__ void sum_kernel(float* values, float bk_size, int start_idx, int end_idx, int resolution, float* avgs){
// Allocate shared memory (assuming a maximum of 1024 threads).
__shared__ float sums[1024];
// Boundary check.
if(blockIdx.x == 0){
avgs[blockIdx.x] = values[start_idx];
return;
}
else if(blockIdx.x == resolution-1) {
avgs[blockIdx.x] = values[start_idx+(end_idx-start_idx)-1];
return;
}
else if(blockIdx.x > resolution -2){
return;
}
// Iteration index calculation.
unsigned int idx_prev = floor((blockIdx.x + 0) * bk_size) + 1;
unsigned int from = idx_prev + threadIdx.x*(bk_size / blockDim.x);
unsigned int to = from + (bk_size / blockDim.x);
to = (to < (end_idx-start_idx))? to : (end_idx-start_idx);
// Partial average calculation using shared memory.
sums[threadIdx.x] = 0;
for (from; from < to; from++)
{
sums[threadIdx.x] += values[from+start_idx];
}
__syncthreads();
// Addition of partial sums.
if(threadIdx.x != 0) return;
from = 1;
for(from; from < 1024; from++)
{
sum += sums[from];
}
avgs[blockIdx.x] = sum;
}
Run Code Online (Sandbox Code Playgroud)
您无法根据声明的变量数估计已使用寄存器的数量.编译器可以使用寄存器进行地址计算或存储未明确声明的临时变量等.
例如,我已经拆解了内核函数的第一部分,即
__global__ void sum_kernel(float* values, float bk_size, int start_idx, int end_idx, int resolution, float* avgs){
// Boundary check.
if(blockIdx.x == 0){
avgs[blockIdx.x] = values[start_idx];
return;
}
else if(blockIdx.x == resolution-1) {
avgs[blockIdx.x] = values[start_idx+(end_idx-start_idx)-1];
return;
}
else if(blockIdx.x > resolution -2){
return;
}
}
Run Code Online (Sandbox Code Playgroud)
具有以下结果
code for sm_20
Function : _Z10sum_kernelPffiiiS_
.headerflags @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */ R1 = [0x1][0x100]
/*0008*/ S2R R2, SR_CTAID.X; /* 0x2c00000094009c04 */ R2 = BlockIdx.x
/*0010*/ MOV R0, c[0x0][0x34]; /* 0x28004000d0001de4 */ R0 = [0x0][0x34]
/*0018*/ ISETP.EQ.AND P0, PT, R2, RZ, PT; /* 0x190e0000fc21dc23 */ if (R2 == 0)
/*0020*/ @P0 BRA 0x78; /* 0x40000001400001e7 */
/*0028*/ MOV R0, c[0x0][0x30]; /* 0x28004000c0001de4 */
/*0030*/ IADD R0, R0, -0x1; /* 0x4800fffffc001c03 */
/*0038*/ ISETP.NE.AND P0, PT, R2, R0, PT; /* 0x1a8e00000021dc23 */
/*0040*/ @P0 EXIT ; /* 0x80000000000001e7 */
/*0048*/ MOV R0, c[0x0][0x2c]; /* 0x28004000b0001de4 */
/*0050*/ ISCADD R2, R2, c[0x0][0x34], 0x2; /* 0x40004000d0209c43 */
/*0058*/ ISCADD R0, R0, c[0x0][0x20], 0x2; /* 0x4000400080001c43 */
/*0060*/ LDU R0, [R0+-0x4]; /* 0x8bfffffff0001c85 */
/*0068*/ ST [R2], R0; /* 0x9000000000201c85 */
/*0070*/ BRA 0x98; /* 0x4000000080001de7 */
/*0078*/ MOV R2, c[0x0][0x28]; /* 0x28004000a0009de4 */
/*0080*/ ISCADD R2, R2, c[0x0][0x20], 0x2; /* 0x4000400080209c43 */
/*0088*/ LDU R2, [R2]; /* 0x8800000000209c85 */ R2 used for addressing and storing gmem data
/*0090*/ ST [R0], R2; /* 0x9000000000009c85 */ R0 used for addressing
/*0098*/ EXIT ; /* 0x8000000000001de7 */
Run Code Online (Sandbox Code Playgroud)
在上面的CUDA代码片段中,没有显式声明的变量.正如你可以从反汇编代码看到,编译器使用3
的寄存器,即R0
,R1
和R2
.这些寄存器在功能上是可交互的,用于存储常量,存储器地址和全局存储器值.