小编use*_*209的帖子

CUDA多GPU执行中的并发性

我正在使用GPU在多个GPU系统上运行cuda内核功能4.我预计它们会同时发布,但事实并非如此.我测量每个内核的开始时间,第二个内核在第一个内核完成执行后开始.因此,在4GPU 上启动内核并不比1单GPU 快.

如何让它们同时工作?

这是我的代码:

cudaSetDevice(0);
GPU_kernel<<< gridDim, threadsPerBlock >>>(d_result_0, parameterA +(0*rateA), parameterB + (0*rateB));
cudaMemcpyAsync(h_result_0, d_result_0, mem_size_result, cudaMemcpyDeviceToHost);

cudaSetDevice(1);
GPU_kernel<<< gridDim, threadsPerBlock >>>(d_result_1, parameterA +(1*rateA), parameterB + (1*rateB));
cudaMemcpyAsync(h_result_1, d_result_1, mem_size_result, cudaMemcpyDeviceToHost);

cudaSetDevice(2);
GPU_kernel<<< gridDim, threadsPerBlock >>>(d_result_2, parameterA +(2*rateA), parameterB + (2*rateB));
cudaMemcpyAsync(h_result_2, d_result_2, mem_size_result, cudaMemcpyDeviceToHost);

cudaSetDevice(3);
GPU_kernel<<< gridDim, threadsPerBlock >>>(d_result_3, parameterA +(3*rateA), parameterB + (3*rateB));
cudaMemcpyAsync(h_result_3, d_result_3, mem_size_result, cudaMemcpyDeviceToHost);
Run Code Online (Sandbox Code Playgroud)

concurrency cuda gpu multiple-gpu

4
推荐指数
1
解决办法
3278
查看次数

标签 统计

concurrency ×1

cuda ×1

gpu ×1

multiple-gpu ×1