首先,我对 CUDA 编程相当陌生,所以我对这样一个简单的问题表示歉意。我已经研究了在 GPU 内核调用中确定 dimGrid 和 dimBlock 的最佳方法,但由于某种原因,我还没有完全让它工作。
我的家用 PC 上有一块GeForce GTX 580(计算能力 2.0)。每块 1024 个线程等。我可以让我的代码在这台 PC 上正常运行。我的 GPU 正在填充大小为 988*988 的距离数组。下面是部分代码:
#define SIZE 988
__global__ void createDistanceTable(double *d_distances, double *d_coordinates)
{
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if(row < SIZE && col < SIZE)
d_distances[row * SIZE + col] =
acos(__sinf(d_coordinates[row * 2 + 0])*
__sinf(d_coordinates[col * 2 + 0])+__cosf(d_coordinates[row * 2 + 0])*
__cosf(d_coordinates[col * 2 + 0])*__cosf(d_coordinates[col * 2 + 1]-
d_coordinates[row * 2 + 1]))*6371;
}
Run Code Online (Sandbox Code Playgroud)
main 中的内核调用:
dim3 dimBlock(32,32,1);
dim3 dimGrid(32,32,1);
createDistanceTable<<<dimGrid, dimBlock>>>(d_distances, d_coordinates);
Run Code Online (Sandbox Code Playgroud)
我的问题是我根本没有找到一种方法让这段代码在我的笔记本电脑上正常运行。我的笔记本电脑的 GPU 是GeForce 9600M GT(计算能力 1.1)。每块 512 个线程等。我非常感谢任何帮助我理解应该如何在笔记本电脑上调用内核调用的 dimBlock 和 dimGrid 的指导。感谢您的任何建议!
您的代码中有几处错误。
__CUDA_ARCH__我想如果您确实需要一些多架构代码,您可以使用。cuda-memcheck)。您可能会分配比您拥有的更多的内存,或者使用比您的 GPU 可以处理的更多的线程/块,并且您不会检测到它。根据您的代码考虑以下示例(我使用float而不是double):
#include <cuda.h>
#include <stdio.h> // printf
#define SIZE 988
#define GRID_SIZE 32
#define BLOCK_SIZE 16 // set to 16 instead of 32 for instance
#define CUDA_CHECK_ERROR() __cuda_check_errors(__FILE__, __LINE__)
#define CUDA_SAFE_CALL(err) __cuda_safe_call(err, __FILE__, __LINE__)
// See: http://codeyarns.com/2011/03/02/how-to-do-error-checking-in-cuda/
inline void
__cuda_check_errors (const char *filename, const int line_number)
{
cudaError err = cudaDeviceSynchronize ();
if (err != cudaSuccess)
{
printf ("CUDA error %i at %s:%i: %s\n",
err, filename, line_number, cudaGetErrorString (err));
exit (-1);
}
}
inline void
__cuda_safe_call (cudaError err, const char *filename, const int line_number)
{
if (err != cudaSuccess)
{
printf ("CUDA error %i at %s:%i: %s\n",
err, filename, line_number, cudaGetErrorString (err));
exit (-1);
}
}
__global__ void
createDistanceTable (float *d_distances, float *d_coordinates)
{
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (row < SIZE && col < SIZE)
d_distances[row * SIZE + col] =
acos (__sinf (d_coordinates[row * 2 + 0]) *
__sinf (d_coordinates[col * 2 + 0]) +
__cosf (d_coordinates[row * 2 + 0]) *
__cosf (d_coordinates[col * 2 + 0]) *
__cosf (d_coordinates[col * 2 + 1] -
d_coordinates[row * 2 + 1])) * 6371;
}
int
main ()
{
float *d_distances;
float *d_coordinates;
CUDA_SAFE_CALL (cudaMalloc (&d_distances, SIZE * SIZE * sizeof (float)));
CUDA_SAFE_CALL (cudaMalloc (&d_coordinates, SIZE * SIZE * sizeof (float)));
dim3 dimGrid (GRID_SIZE, GRID_SIZE);
dim3 dimBlock (BLOCK_SIZE, BLOCK_SIZE);
createDistanceTable <<< dimGrid, dimBlock >>> (d_distances, d_coordinates);
CUDA_CHECK_ERROR ();
CUDA_SAFE_CALL (cudaFree (d_distances));
CUDA_SAFE_CALL (cudaFree (d_coordinates));
}
Run Code Online (Sandbox Code Playgroud)
编译命令(相应改变架构):
nvcc prog.cu -g -G -lineinfo -gencode arch=compute_11,code=sm_11 -o prog
对于 CC 2.0 上的 32x32 块或 CC 1.1 上的 16x16 块:
Run Code Online (Sandbox Code Playgroud)cuda-memcheck ./prog ========= CUDA-MEMCHECK ========= ERROR SUMMARY: 0 errors
对于 CC 2.0 上的 33x33 块或 CC 1.1 上的 32x32 块:
Run Code Online (Sandbox Code Playgroud)cuda-memcheck ./prog ========= CUDA-MEMCHECK ========= Program hit error 9 on CUDA API call to cudaLaunch ========= Saved host backtrace up to driver entry point at error ========= Host Frame:/usr/lib/nvidia-current-updates/libcuda.so [0x26a230] ========= Host Frame:/opt/cuda/lib64/libcudart.so.5.0 (cudaLaunch + 0x242) [0x2f592] ========= Host Frame:./prog [0xc76] ========= Host Frame:./prog [0xa99] ========= Host Frame:./prog [0xac4] ========= Host Frame:./prog [0x9d1] ========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d] ========= Host Frame:./prog [0x859] ========= ========= ERROR SUMMARY: 1 error
错误9:
Run Code Online (Sandbox Code Playgroud)/** * This indicates that a kernel launch is requesting resources that can * never be satisfied by the current device. Requesting more shared memory * per block than the device supports will trigger this error, as will * requesting too many threads or blocks. See ::cudaDeviceProp for more * device limitations. */ cudaErrorInvalidConfiguration = 9,
| 归档时间: |
|
| 查看次数: |
13739 次 |
| 最近记录: |