use*_*218 44 c++ performance gcc sse avx
我已经使用英特尔的SSE内在函数已经有一段时间了,性能有了很好的提升.因此,我期望AVX内在函数能够进一步加速我的程序.不幸的是,直到现在情况并非如此.可能我犯了一个愚蠢的错误,所以如果有人能帮助我,我将非常感激.
我使用Ubuntu 11.10和g ++ 4.6.1.我编写了我的程序(见下文)
g++ simpleExample.cpp -O3 -march=native -o simpleExample
Run Code Online (Sandbox Code Playgroud)
测试系统配有Intel i7-2600 CPU.
这是代表我的问题的代码.在我的系统上,我得到输出
98.715 ms, b[42] = 0.900038 // Naive
24.457 ms, b[42] = 0.900038 // SSE
24.646 ms, b[42] = 0.900038 // AVX
Run Code Online (Sandbox Code Playgroud)
注意,仅选择计算sqrt(sqrt(sqrt(x)))以确保内存带宽不限制执行速度; 这只是一个例子.
simpleExample.cpp:
#include <immintrin.h>
#include <iostream>
#include <math.h>
#include <sys/time.h>
using namespace std;
// -----------------------------------------------------------------------------
// This function returns the current time, expressed as seconds since the Epoch
// -----------------------------------------------------------------------------
double getCurrentTime(){
struct timeval curr;
struct timezone tz;
gettimeofday(&curr, &tz);
double tmp = static_cast<double>(curr.tv_sec) * static_cast<double>(1000000)
+ static_cast<double>(curr.tv_usec);
return tmp*1e-6;
}
// -----------------------------------------------------------------------------
// Main routine
// -----------------------------------------------------------------------------
int main() {
srand48(0); // seed PRNG
double e,s; // timestamp variables
float *a, *b; // data pointers
float *pA,*pB; // work pointer
__m128 rA,rB; // variables for SSE
__m256 rA_AVX, rB_AVX; // variables for AVX
// define vector size
const int vector_size = 10000000;
// allocate memory
a = (float*) _mm_malloc (vector_size*sizeof(float),32);
b = (float*) _mm_malloc (vector_size*sizeof(float),32);
// initialize vectors //
for(int i=0;i<vector_size;i++) {
a[i]=fabs(drand48());
b[i]=0.0f;
}
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// Naive implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
s = getCurrentTime();
for (int i=0; i<vector_size; i++){
b[i] = sqrtf(sqrtf(sqrtf(a[i])));
}
e = getCurrentTime();
cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;
// -----------------------------------------------------------------------------
for(int i=0;i<vector_size;i++) {
b[i]=0.0f;
}
// -----------------------------------------------------------------------------
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// SSE2 implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
pA = a; pB = b;
s = getCurrentTime();
for (int i=0; i<vector_size; i+=4){
rA = _mm_load_ps(pA);
rB = _mm_sqrt_ps(_mm_sqrt_ps(_mm_sqrt_ps(rA)));
_mm_store_ps(pB,rB);
pA += 4;
pB += 4;
}
e = getCurrentTime();
cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;
// -----------------------------------------------------------------------------
for(int i=0;i<vector_size;i++) {
b[i]=0.0f;
}
// -----------------------------------------------------------------------------
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
// AVX implementation
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
pA = a; pB = b;
s = getCurrentTime();
for (int i=0; i<vector_size; i+=8){
rA_AVX = _mm256_load_ps(pA);
rB_AVX = _mm256_sqrt_ps(_mm256_sqrt_ps(_mm256_sqrt_ps(rA_AVX)));
_mm256_store_ps(pB,rB_AVX);
pA += 8;
pB += 8;
}
e = getCurrentTime();
cout << (e-s)*1000 << " ms" << ", b[42] = " << b[42] << endl;
_mm_free(a);
_mm_free(b);
return 0;
}
Run Code Online (Sandbox Code Playgroud)
任何帮助表示赞赏!
Evg*_*uev 10
如果您对增加平方根性能感兴趣,可以使用VRSQRTPS和Newton-Raphson公式代替VSQRTPS:
x0 = vrsqrtps(a)
x1 = 0.5 * x0 * (3 - (a * x0) * x0)
Run Code Online (Sandbox Code Playgroud)
VRSQRTPS本身并没有受益于AVX,但其他计算也是如此.
如果23位精度足够你使用它.
只是为了完整.如果您的代码中的操作数量有限,那么Newton-Raphson(NR)对分区或平方根等操作的实现将非常有用.这是因为如果您使用这些替代方法,您将在其他端口(例如乘法和加法端口)上产生更大的压力.这基本上是x86架构具有处理这些操作的特殊硬件单元而不是替代软件解决方案(如NR)的原因.我引用了英特尔64和IA-32架构优化参考手册 p.556:
"在某些情况下,当除法或平方根运算是隐藏这些运算的一些延迟的较大算法的一部分时,使用Newton-Raphson的近似可能会减慢执行速度."
因此在大型算法中使用NR时要小心.实际上,我有关于这一点的硕士论文,我将在这里留下一个链接,以供将来参考,一旦发布.
同样对于人们如何总是想知道某些指令的吞吐量和延迟,请查看IACA.它是英特尔提供的一种非常有用的工具,用于静态分析代码的内核执行性能.
编辑 这里是论文的链接那些有兴趣谁论文
根据您的处理器硬件,AVX指令可以作为SSE指令在硬件中进行仿真.您需要查找处理器的部件号以获得准确的规格,但这是低端和高端英特尔处理器之间的主要区别之一,专用执行单元的数量与硬件仿真的比较.