到目前为止,我已经编写了在程序中只调用一次内核的程序
所以我有一个内核
__global__ void someKernel(float * d_in ){ //Any parameters
//some operation
}
Run Code Online (Sandbox Code Playgroud)
我基本上是这样做的
main()
{
//create an array in device memory
cudaMalloc(......);
//move host data to that array
cudaMemcpy(......,cudaMemcpyHostToDevice);
//call the kernel
someKernel<< <nblocks,512>> >(.......);
//copy results to host memory
cudaMemcpy(......,cudaMemcpyDeviceToHost);
// Point to notice HERE
}
Run Code Online (Sandbox Code Playgroud)
它工作正常。但是这次我想不仅一次而且多次调用内核 类似
main()
{
//create an array in device memory
cudaMalloc(......);
//move host data to that array
cudaMemcpy(......,cudaMemcpyHostToDevice);
//call the kernel
someKernel<< <nblocks,512>> >(.......);
//copy results to host memory
cudaMemcpy(......,cudaMemcpyDeviceToHost);
// From here
//Some unrelated calculations here
dothis();
dothat();
//Then again the kernel repeteadly
for(k: some_ks)
{
// Do some pre-calculations
//call the kernel
someKernel<< <nblocks,512>> >(.......);
// some post calculations
}
}
Run Code Online (Sandbox Code Playgroud)
我的问题是我应该在第一次调用内核和在 for 循环中(以及在每次迭代中)调用内核之间使用某种同步cudaDeviceSynchronize吗?或者没有必要?
| 归档时间: |
|
| 查看次数: |
56 次 |
| 最近记录: |