编辑:这个问题是原版的重新版本,因此前几个回复可能不再相关.
我很好奇设备函数调用强制非内联对设备函数内同步的影响.我有一个简单的测试内核,用于说明相关行为.
内核获取一个缓冲区并将其传递给设备函数,以及共享缓冲区和指示符变量,该变量将单个线程标识为"boss"线程.设备功能有不同的代码:boss线程首先花时间对共享缓冲区执行简单的操作,然后写入全局缓冲区.在同步调用之后,所有线程都写入全局缓冲区.在内核调用之后,主机打印全局缓冲区的内容.这是代码:
CUDA代码:
test_main.cu
#include<cutil_inline.h>
#include "test_kernel.cu"
int main()
{
int scratchBufferLength = 100;
int *scratchBuffer;
int *d_scratchBuffer;
int b = 1;
int t = 64;
// copy scratch buffer to device
scratchBuffer = (int *)calloc(scratchBufferLength,sizeof(int));
cutilSafeCall( cudaMalloc(&d_scratchBuffer,
sizeof(int) * scratchBufferLength) );
cutilSafeCall( cudaMemcpy(d_scratchBuffer, scratchBuffer,
sizeof(int)*scratchBufferLength, cudaMemcpyHostToDevice) );
// kernel call
testKernel<<<b, t>>>(d_scratchBuffer);
cudaThreadSynchronize();
// copy data back to host
cutilSafeCall( cudaMemcpy(scratchBuffer, d_scratchBuffer,
sizeof(int) * scratchBufferLength, cudaMemcpyDeviceToHost) );
// print results
printf("Scratch buffer contents: \t");
for(int i=0; i < …
Run Code Online (Sandbox Code Playgroud)