自CUDA 9起,shfl指令已弃用,应由shfl_sync替换.
但是,当他们表现不同时,我该如何更换它们呢?
代码示例:
__global__
static void shflTest(){
int tid = threadIdx.x;
float value = tid + 0.1f;
int* ivalue = reinterpret_cast<int*>(&value);
//use the integer shfl
int ix = __shfl(ivalue[0],5,32);
int iy = __shfl_sync(ivalue[0],5,32);
float x = reinterpret_cast<float*>(&ix)[0];
float y = reinterpret_cast<float*>(&iy)[0];
if(tid == 0){
printf("shfl tmp %d %d\n",ix,iy);
printf("shfl final %f %f\n",x,y);
}
}
int main()
{
shflTest<<<1,32>>>();
cudaDeviceSynchronize();
return 0;
}
Run Code Online (Sandbox Code Playgroud)
输出:
shfl tmp 1084437299 5
shfl final 5.100000 0.000000
Run Code Online (Sandbox Code Playgroud) 除了__syncthreads()同步线程块内的扭曲的函数之外,还有另一个函数称为__syncwarp(). 这个函数究竟有什么作用?
在CUDA编程指南说,
will cause the executing thread to wait until all warp lanes named in mask have executed a __syncwarp() (with the same mask) before resuming execution. All non-exited threads named in mask must execute a corresponding __syncwarp() with the same mask, or the result is undefined.
Executing __syncwarp() guarantees memory ordering among threads participating in the barrier. Thus, threads within a warp that wish to communicate via memory can store to memory, …
cuda ×2