Mic*_*ner 0 reduction opencl jocl
If I use a barrier (no matter if CLK_LOCAL_MEM_FENCE or CLK_GLOBAL_MEM_FENCE) in my kernel, it causes a CL_INVALID_WORK_GROUP_SIZE error. The global work size is 512, the local work size is 128, 65536 items have to be computed, the max work group size of my device is 1024, I am using only one dimension. For Java bindings I use JOCL.
The kernel is very simple:
kernel void sum(global float *input, global float *output, const int numElements, local float *localCopy
{
localCopy[get_local_id(0)] = grid[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE); // or barrier(CLK_GLOBAL_MEM_FENCE)
}
Run Code Online (Sandbox Code Playgroud)
I run the kernel on the Intel(R) Xeon(R) CPU X5570 @ 2.93GHz and can use OpenCL 1.2. The calling method looks like
kernel.putArg(aCLBuffer).putArg(bCLBuffer).putArg(elementCount).putNullArg(localWorkSize);
queue.put1DRangeKernel(kernel, 0, globalWorkSize, localWorkSize);
Run Code Online (Sandbox Code Playgroud)
But the error is always the same:
[...]can not enqueue 1DRange CLKernel [...] with gwo: null gws: {512} lws: {128}
cond.: null events: null [error: CL_INVALID_WORK_GROUP_SIZE]
Run Code Online (Sandbox Code Playgroud)
What I am doing wrong?
这是某些OpenCL平台上的预期行为.例如,我的苹果系统上,则CPU装置具有1024的最大工作组大小然而,如果内核内部具有阻挡,则该特定的内核的最大工作组大小减小到1.
您可以使用clGetKernelWorkGroupInfo带CL_KERNEL_WORK_GROUP_SIZE参数的函数来查询特定内核的最大工作组大小.返回的值将不超过由返回的值更clGetDeviceInfo和CL_DEVICE_MAX_WORK_GROUP_SIZE,但是被允许要少(因为它是在这种情况下).