相关疑难解决方法(0)

对于GPU上与数据无关的问题,每个元素启动1个线程是否始终是最佳选择?

我正在编写一个简单的memcpy内核,以测量GTX 760M的内存带宽并将其与cudaMemcpy()进行比较。看起来像这样:

template<unsigned int THREADS_PER_BLOCK>
__global__ static
void copy(void* src, void* dest, unsigned int size) {
    using vector_type = int2;
    vector_type* src2 = reinterpret_cast<vector_type*>(src);
    vector_type* dest2 = reinterpret_cast<vector_type*>(dest);

    //This copy kernel is only correct when size%sizeof(vector_type)==0
    auto numElements = size / sizeof(vector_type);

    for(auto id = THREADS_PER_BLOCK * blockIdx.x + threadIdx.x; id < numElements ; id += gridDim.x * THREADS_PER_BLOCK){
        dest2[id] = src2[id];
    }
}
Run Code Online (Sandbox Code Playgroud)

我还计算了达到100%占用率所需的块数,如下所示:

THREADS_PER_BLOCK = 256 
Multi-Processors: 4 
Max Threads per Multi Processor: 2048 
NUM_BLOCKS = 4 * …
Run Code Online (Sandbox Code Playgroud)

cuda gpu gpgpu

1
推荐指数
1
解决办法
504
查看次数

标签 统计

cuda ×1

gpgpu ×1

gpu ×1