我正在使用Point Cloud Library.它主要是在C++中编译时,会出现这样的错误:
[ 0%] Building CXX object common/CMakeFiles/pcl_common.dir/src/intersections.cpp.o
In file included from /home/lv/pcl-trunk/common/include/pcl/point_types.h:301:0,
from /home/lv/pcl-trunk/common/include/pcl/common/impl/common.hpp:41,
from /home/lv/pcl-trunk/common/include/pcl/common/common.h:186,
from /home/lv/pcl-trunk/common/include/pcl/common/intersections.h:41,
from /home/lv/pcl-trunk/common/src/intersections.cpp:38:
/home/lv/pcl-trunk/common/include/pcl/impl/point_types.hpp:1009:68: warning: ‘SHOT’ is deprecated [-Wdeprecated-declarations]
/tmp/ccRLy4Re.s: Assembler messages:
/tmp/ccRLy4Re.s:2488: Error: no such instruction: `vfmadd312ss (%r9),%xmm2,%xmm1'
/tmp/ccRLy4Re.s:2638: Error: no such instruction: `vfmadd312ss (%rdx),%xmm2,%xmm1'
/tmp/ccRLy4Re.s:3039: Error: no such instruction: `vfmadd312ss (%rax,%r11,4),%xmm5,%xmm1'
/tmp/ccRLy4Re.s:3402: Error: no such instruction: `vfmadd312ss (%rax,%r11,4),%xmm5,%xmm1'
/tmp/ccRLy4Re.s:3534: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm1,%xmm2'
/tmp/ccRLy4Re.s:3628: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm1,%xmm2'
/tmp/ccRLy4Re.s:6103: Error: no such instruction: `vfmadd312ss (%r11),%xmm0,%xmm4' …
Run Code Online (Sandbox Code Playgroud) 在cuda示例中,当他们分配网格大小时,我有一个共同的习惯.以下是一个例子:
int
main(){
...
int numElements = 50000;
int threadsPerBlock = 1024;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
...
}
__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements)
{
C[i] = A[i] + B[i];
}
}
Run Code Online (Sandbox Code Playgroud)
我很好奇的是blocksPerGrid的初始化.我不明白为什么会这样
int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
Run Code Online (Sandbox Code Playgroud)
而不是直截了当
int blocksPerGrid = numElements / threadsPerblock;
Run Code Online (Sandbox Code Playgroud)
这似乎是一种很常见的习惯.我在各种项目中看到过.他们都是这样做的.我是cuda的新手.欢迎任何解释或背后的知识.
我不知道cuda不支持引用参数.我的程序中有以下两个功能:
__global__ void
ExtractDisparityKernel ( ExtractDisparity& es)
{
es ();
}
__device__ __forceinline__ void
computeAdjacentValue (int x1, int y1, int x2, int y2, float& value )
{ ....
}
Run Code Online (Sandbox Code Playgroud)
给定全局函数,编译器报告错误:/home/lv/pcl-trunk/gpu/kinfu_large_scale/src/cuda/estimate_combined.cu(959):错误:全局例程不能有引用参数
我搜索了一些解决方案.有人说不允许这样做.但设备功能不会报告此类错误.我很困惑,cuda是否支持参考论证.或者编译器被某种方式欺骗了.
任何人都可以给出这个问题的完整答案:允许参考但不允许参考?
我想计算Cuda中整个图像的平均值.为了测试2D数组的减少效果,我在下面编写了这个内核.最终输出o应该是所有图像值的总和.输入g是2D阵列,每个像素的值为1.但是这个程序的结果是总和为0.对我来说有点奇怪.
我在本教程中模仿1D阵列的减少http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf我写这个2D表格.我是Cuda的新手.欢迎提出有关潜在错误和改进的建议!
只需添加一条评论.我知道计算一维数组的平均值是有意义的.但我想利用更多并测试更复杂的还原行为.这可能不对.但只是一个考验.希望任何人都能给我更多关于减少常见做法的建议.
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
cudaEvent_t start, stop;
float elapsedTime;
__global__ void
reduce(float *g, float *o, const int dimx, const int dimy)
{
extern __shared__ float sdata[];
unsigned int tid_x = threadIdx.x;
unsigned int tid_y = threadIdx.y;
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int j = blockDim.y * blockIdx.y + threadIdx.y;
if (i >= dimx || j >= dimy)
return;
sdata[tid_x*blockDim.y + tid_y] = g[i*dimy + j];
__syncthreads(); …
Run Code Online (Sandbox Code Playgroud)