我正在编写一个简单的memcpy内核,以测量GTX 760M的内存带宽并将其与cudaMemcpy()进行比较。看起来像这样:
template<unsigned int THREADS_PER_BLOCK>
__global__ static
void copy(void* src, void* dest, unsigned int size) {
using vector_type = int2;
vector_type* src2 = reinterpret_cast<vector_type*>(src);
vector_type* dest2 = reinterpret_cast<vector_type*>(dest);
//This copy kernel is only correct when size%sizeof(vector_type)==0
auto numElements = size / sizeof(vector_type);
for(auto id = THREADS_PER_BLOCK * blockIdx.x + threadIdx.x; id < numElements ; id += gridDim.x * THREADS_PER_BLOCK){
dest2[id] = src2[id];
}
}
Run Code Online (Sandbox Code Playgroud)
我还计算了达到100%占用率所需的块数,如下所示:
THREADS_PER_BLOCK = 256
Multi-Processors: 4
Max Threads per Multi Processor: 2048
NUM_BLOCKS = 4 * …Run Code Online (Sandbox Code Playgroud)