除了__syncthreads()
同步线程块内的扭曲的函数之外,还有另一个函数称为__syncwarp()
. 这个函数究竟有什么作用?
在CUDA编程指南说,
will cause the executing thread to wait until all warp lanes named in mask have executed a __syncwarp() (with the same mask) before resuming execution. All non-exited threads named in mask must execute a corresponding __syncwarp() with the same mask, or the result is undefined.
Executing __syncwarp() guarantees memory ordering among threads participating in the barrier. Thus, threads within a warp that wish to communicate via memory can store to memory, execute __syncwarp(), and then safely read values stored by other threads in the warp.
So does this mean that this function ensures synchronization in threads within a warp that is included by the mask? If so, do we need such synchronization within the threads in the same warp since they all are ensured to be executed in lockstep?