Mew*_*ewa 2 c++ optimization performance
我正在查看以下代码,并且在计算其性能时发现了一些奇怪的东西.
为了记录,我在Visual Studio 2010,Windows 7 x64,-O2优化和发布模式下执行此操作.我的处理器是Intel i5.
代码中有一个部分可以写入内存.我曾经这样做过:
d_res_matrix[x][y] = a;
Run Code Online (Sandbox Code Playgroud)
在这种情况下,执行整个程序大约需要2.3秒.我正在试图使代码变得更快,并且我做了这个:
d_res_matrix[x][y] = a + 0.00000001;
Run Code Online (Sandbox Code Playgroud)
在0.4秒内执行!这是一个巨大的差异,但我不确定为什么会发生这种情况.
对我而言,如果它更慢,因为额外的加法操作需要时间.我想我的替代假设是,以某种方式进行添加会强制编译器对此操作进行SIMD(获取,添加和写入?).也许写入否则会阻止管道,但这可以防止这种情况?有任何想法吗?
编辑(4月6日,6:19):问题在我的家用电脑上是一样的(Visual Studio 2012).
编辑(4月6日,6:38):Visual Studio 2008中也存在该问题(-O2,发布).在Debug中,它们都很慢,但速度相同.
编辑(4月8日,1:28):我安装了英特尔Parallel Studio XE(我是一名学生),它向我展示了很多好东西 - 首先,我从未真正删除过我宣布的数组(我是现在不修理它,但要注意).但是,释放内存实际上并没有解决任何问题.正如理查德在答案中所述,整个问题是由非正规浮点值引起的(请参阅此处的更多信息).FP单元无法正确处理非正规值,而是启动微码序列,这非常慢.
#include <time.h>
#include <stdio.h>
#include <cstdlib>
#include <stdlib.h>
#define DIM 1000
#define ITERATIONS 100
#define CPU_START clock_t t1; t1=clock();
#define CPU_END {long int final=clock()-t1; printf("CPU took %li ticks (%f seconds) \n", final, ((float)final)/CLOCKS_PER_SEC);}
int main(void)
{
double ** d_matrix, ** d_res_matrix;
d_res_matrix = new double * [DIM];
d_matrix = new double * [DIM];
for (int i = 0; i < DIM; i++)
{
d_matrix[i] = new double [DIM];
d_res_matrix[i] = new double[DIM];
}
d_matrix[20][45] = 1; // start somewhere
double f0, f1, f2, f3, f4;
CPU_START;
for (int iter = 0; iter < ITERATIONS; iter++)
{
for (int x = 1; x < DIM-1; x++) // avoid boundary cases for this example
{
for (int y = 1; y < DIM-1; y++)
{
f0 = d_matrix[x][y];
f1 = d_matrix[x-1][y];
f2 = d_matrix[x+1][y];
f3 = d_matrix[x][y-1];
f4 = d_matrix[x][y+1];
double a = f0*0.6 + f1*0.1 + f2*0.1 + f3*0.1 + f4*0.1;
// THIS PART IS INTERESTING:
//d_res_matrix[x][y] = a;
d_res_matrix[x][y] = a + 0.000000001;
}
}
for (int x = 1; x < DIM-1; x++)
{
for (int y = 1; y < DIM-1; y++)
{
d_matrix[x][y] = d_res_matrix[x][y];
}
}
}
CPU_END;
return 0;
}
Run Code Online (Sandbox Code Playgroud)
以下是输出的一些屏幕截图,表明这不是一次性发生:没有更多的屏幕截图:D:D:D:D:D这里有一些文字!
没有补充:
CPU took 3585 ticks <3.585000 seconds>
CPU took 3592 ticks <3.592000 seconds>
CPU took 3430 ticks <3.430000 seconds>
CPU took 2032 ticks <2.032000 seconds>
CPU took 3117 ticks <3.117000 seconds>
CPU took 2050 ticks <2.050000 seconds>
CPU took 3266 ticks <3.266000 seconds>
CPU took 3394 ticks <3.394000 seconds>
CPU took 3446 ticks <3.446000 seconds>
CPU took 3131 ticks <3.131000 seconds>
Run Code Online (Sandbox Code Playgroud)
另外:
CPU took 430 ticks <0.430000 seconds>
CPU took 428 ticks <0.428000 seconds>
CPU took 470 ticks <0.470000 seconds>
CPU took 470 ticks <0.470000 seconds>
CPU took 470 ticks <0.470000 seconds>
CPU took 470 ticks <0.470000 seconds>
CPU took 460 ticks <0.460000 seconds>
CPU took 471 ticks <0.471000 seconds>
CPU took 471 ticks <0.471000 seconds>
CPU took 460 ticks <0.460000 seconds>
Run Code Online (Sandbox Code Playgroud)
小智 5
你可能在第一次运行时生成了非正规数,这个加法会避免.对这些非正规的后续操作可能会非常耗费精力.在调试模式下,您的数据将初始化为0,但这不会在发布中发生,因此您运行的值可能是任何值.如果你在分配后明确地将d_matrix memset为0,你仍然会看到这种行为,我会感到惊讶.
| 归档时间: |
|
| 查看次数: |
117 次 |
| 最近记录: |