Gri*_*art 8 c double performance sse quaternions
我读了一些关于使用SSE内在函数的内容,并尝试了用双精度实现四元数旋转的运气.以下是我写的正常和SSE函数,
void quat_rot(quat_t a, REAL* restrict b){
///////////////////////////////////////////
// Multiply vector b by quaternion a //
///////////////////////////////////////////
REAL cross_temp[3],result[3];
cross_temp[0]=a.el[2]*b[2]-a.el[3]*b[1]+a.el[0]*b[0];
cross_temp[1]=a.el[3]*b[0]-a.el[1]*b[2]+a.el[0]*b[1];
cross_temp[2]=a.el[1]*b[1]-a.el[2]*b[0]+a.el[0]*b[2];
result[0]=b[0]+2.0*(a.el[2]*cross_temp[2]-a.el[3]*cross_temp[1]);
result[1]=b[1]+2.0*(a.el[3]*cross_temp[0]-a.el[1]*cross_temp[2]);
result[2]=b[2]+2.0*(a.el[1]*cross_temp[1]-a.el[2]*cross_temp[0]);
b[0]=result[0];
b[1]=result[1];
b[2]=result[2];
}
有了SSE
inline void cross_p(__m128d *a, __m128d *b, __m128d *c){
const __m128d SIGN_NP = _mm_set_pd(0.0, -0.0);
__m128d l1 = _mm_mul_pd( _mm_unpacklo_pd(a[1], a[1]), b[0] );
__m128d l2 = _mm_mul_pd( _mm_unpacklo_pd(b[1], b[1]), a[0] );
__m128d m1 = _mm_sub_pd(l1, l2);
m1 = _mm_shuffle_pd(m1, m1, 1);
m1 = _mm_xor_pd(m1, SIGN_NP);
l1 = _mm_mul_pd( a[0], _mm_shuffle_pd(b[0], b[0], 1) );
__m128d m2 = _mm_sub_sd(l1, _mm_unpackhi_pd(l1, l1));
c[0] = m1;
c[1] = m2;
}
void quat_rotSSE(quat_t a, REAL* restrict b){
///////////////////////////////////////////
// Multiply vector b by quaternion a //
///////////////////////////////////////////
__m128d axb[2];
__m128d aa[2];
aa[0] = _mm_load_pd(a.el+1);
aa[1] = _mm_load_sd(a.el+3);
__m128d bb[2];
bb[0] = _mm_load_pd(b);
bb[1] = _mm_load_sd(b+2);
cross_p(aa, bb, axb);
__m128d w = _mm_set1_pd(a.el[0]);
axb[0] = _mm_add_pd(axb[0], _mm_mul_pd(w, bb[0]));
axb[1] = _mm_add_sd(axb[1], _mm_mul_sd(w, bb[1]));
cross_p(aa, axb, axb);
_mm_store_pd(b, _mm_add_pd(bb[0], _mm_add_pd(axb[0], axb[0])));
_mm_store_sd(b+2, _mm_add_pd(bb[1], _mm_add_sd(axb[1], axb[1])));
}
旋转基本上是使用函数完成的,

然后我运行以下测试来检查每个函数执行一组旋转所花费的时间,
int main(int argc, char *argv[]){
REAL a[] __attribute__ ((aligned(16))) = {0.2, 1.3, 2.6};
quat_t q = {{0.1, 0.7, -0.3, -3.2}};
REAL sum = 0.0;
for(int i = 0; i < 4; i++) sum += q.el[i] * q.el[i];
sum = sqrt(sum);
for(int i = 0; i < 4; i++) q.el[i] /= sum;
int N = 1000000000;
for(int i = 0; i < N; i++){
quat_rotSSE(q, a);
}
printf("rot = ");
for(int i = 0; i < 3; i++) printf("%f, ", a[i]);
printf("\n");
return 0;
}
我用gcc 4.6.3和-O3 -std = c99 -msse3编译.
使用unix的正常功能的时间time是SSE 1的18.841s和21.689s.
我错过了什么,为什么SSE实施比正常实施慢15%?在哪种情况下,SSE实现对双精度更快?
编辑:从评论中获取建议,我尝试了几件事,
restrict该cross_p功能并添加__m128d来保存第二个交叉产品.这与所生产的组件没有区别.movapd.为SSE功能生成的汇编代码仅比正常函数少4行.
编辑:添加了生成的程序集的链接,
Bre*_*dan 10
当你在大量元素上执行相同的操作时,SSE(以及一般的SIMD)工作得非常好,其中操作之间没有依赖关系.例如,如果你有一个double数组并且需要array[i] = (array[i] * K + L)/M + N;为每个元素做,那么SSE/SIMD会有所帮助.
如果您没有对大量元素执行相同的操作,那么SSE没有帮助.例如,如果你有一个双重需要做,foo = (foo * K + L)/M + N;那么SSE/SIMD将没有帮助.
基本上,SSE是这项工作的错误工具.您需要将工作更改为SSE是正确工具的工作.例如,而不是将一个向量乘以一个四元数; 尝试将一个1000个向量的数组乘以四元数,或者将1000个向量的数组乘以1000个四元数的数组.
编辑:在这里添加了所有内容!
请注意,这通常意味着修改数据结构以适应.例如,不是拥有一个结构数组,而是拥有一个数组结构通常会更好.
更好的例子,假设你的代码使用了一个四元数组,如下所示:
for(i = 0; i < quaternionCount; i++) {
cross_temp[i][0] = a[i][2] * b[i][2] - a[i][3] * b[i][1] + a[i][0] * b[i][0];
cross_temp[i][1] = a[i][3] * b[i][0] - a[i][1] * b[i][2] + a[i][0] * b[i][1];
cross_temp[i][2] = a[i][1] * b[i][1] - a[i][2] * b[i][0] + a[i][0] * b[i][2];
b[i][0] = b[i][0] + 2.0 * (a[i][2] * cross_temp[i][2] - a[i][3] * cross_temp[i][1]);
b[i][1] = b[i][1] + 2.0 * (a[i][3] * cross_temp[i][0] - a[i][1] * cross_temp[i][2]);
b[i][2] = b[i][2] + 2.0 * (a[i][1] * cross_temp[i][1] - a[i][2] * cross_temp[i][0]);
}
Run Code Online (Sandbox Code Playgroud)
第一步是将其转换为数组的四元数,并执行以下操作:
for(i = 0; i < quaternionCount; i++) {
cross_temp[0][i] = a[2][i] * b[2][i] - a[3][i] * b[1][i] + a[0][i] * b[0][i];
cross_temp[1][i] = a[3][i] * b[0][i] - a[1][i] * b[2][i] + a[0][i] * b[1][i];
cross_temp[2][i] = a[1][i] * b[1][i] - a[2][i] * b[0][i] + a[0][i] * b[2][i];
b[0][i] = b[0][i] + 2.0 * (a[2][i] * cross_temp[2][i] - a[3][i] * cross_temp[1][i]);
b[1][i] = b[1][i] + 2.0 * (a[3][i] * cross_temp[0][i] - a[1][i] * cross_temp[2][i]);
b[2][i] = b[2][i] + 2.0 * (a[1][i] * cross_temp[1][i] - a[2][i] * cross_temp[0][i]);
}
Run Code Online (Sandbox Code Playgroud)
然后,因为2个相邻的双精度数适合单个SSE寄存器,所以要将循环展开2:
for(i = 0; i < quaternionCount; i += 2) {
cross_temp[0][i] = a[2][i] * b[2][i] - a[3][i] * b[1][i] + a[0][i] * b[0][i];
cross_temp[0][i+1] = a[2][i+1] * b[2][i+1] - a[3][i+1] * b[1][i+1] + a[0][i+1] * b[0][i+1];
cross_temp[1][i] = a[3][i] * b[0][i] - a[1][i] * b[2][i] + a[0][i] * b[1][i];
cross_temp[1][i+1] = a[3][i+1] * b[0][i+1] - a[1][i+1] * b[2][i+1] + a[0][i+1] * b[1][i+1];
cross_temp[2][i] = a[1][i] * b[1][i] - a[2][i] * b[0][i] + a[0][i] * b[2][i];
cross_temp[2][i+1] = a[1][i+1] * b[1][i+1] - a[2][i+1] * b[0][i+1] + a[0][i+1] * b[2][i+1];
b[0][i] = b[0][i] + 2.0 * (a[2][i] * cross_temp[2][i] - a[3][i] * cross_temp[1][i]);
b[0][i+1] = b[0][i+1] + 2.0 * (a[2][i+1] * cross_temp[2][i+1] - a[3][i+1] * cross_temp[1][i+1]);
b[1][i] = b[1][i] + 2.0 * (a[3][i] * cross_temp[0][i] - a[1][i] * cross_temp[2][i]);
b[1][i+1] = b[1][i+1] + 2.0 * (a[3][i+1] * cross_temp[0][i+1] - a[1][i+1] * cross_temp[2][i+1]);
b[2][i] = b[2][i] + 2.0 * (a[1][i] * cross_temp[1][i] - a[2][i] * cross_temp[0][i]);
b[2][i+1] = b[2][i+1] + 2.0 * (a[1][i+1] * cross_temp[1][i+1] - a[2][i+1] * cross_temp[0][i+1]);
}
Run Code Online (Sandbox Code Playgroud)
现在,您想将其分解为单独的操作.例如,内部循环的前两行将变为:
cross_temp[0][i] = a[2][i] * b[2][i];
cross_temp[0][i] -= a[3][i] * b[1][i];
cross_temp[0][i] += a[0][i] * b[0][i];
cross_temp[0][i+1] = a[2][i+1] * b[2][i+1];
cross_temp[0][i+1] -= a[3][i+1] * b[1][i+1];
cross_temp[0][i+1] += a[0][i+1] * b[0][i+1];
Run Code Online (Sandbox Code Playgroud)
现在重新订购:
cross_temp[0][i] = a[2][i] * b[2][i];
cross_temp[0][i+1] = a[2][i+1] * b[2][i+1];
cross_temp[0][i] -= a[3][i] * b[1][i];
cross_temp[0][i+1] -= a[3][i+1] * b[1][i+1];
cross_temp[0][i] += a[0][i] * b[0][i];
cross_temp[0][i+1] += a[0][i+1] * b[0][i+1];
Run Code Online (Sandbox Code Playgroud)
完成所有这些后,请考虑转换为SSE.前2行代码是一个负载(加载两者a[2][i]和a[2][i+1]成SSE寄存器),随后由一个乘法(而不是2个独立的载荷和2个独立的乘法).这6行可能成为(伪代码):
load SSE_register1 with both a[2][i] and a[2][i+1]
multiply SSE_register1 with both b[2][i] and b[2][i+1]
load SSE_register2 with both a[3][i] and a[3][i+1]
multiply SSE_register2 with both b[1][i] and b[1][i+1]
load SSE_register2 with both a[0][i] and a[0][i+1]
multiply SSE_register2 with both b[0][i] and b[0][i+1]
SE_register1 = SE_register1 - SE_register2
SE_register1 = SE_register1 + SE_register3
Run Code Online (Sandbox Code Playgroud)
这里的每一行伪代码都是单个SSE指令/内在函数; 并且每个SSE指令/内部函数并行执行2个操作.
如果每条指令并行执行2次操作,那么(理论上)它可能是原始"每条指令一次操作"代码的两倍.
| 归档时间: |
|
| 查看次数: |
1096 次 |
| 最近记录: |