the*_*dev 12 c c++ optimization compiler-optimization
好吧,我一直在和一位朋友谈论编译器和程序的优化,他建议这样n * 0.5做比n / 2.我说编译器会自动执行这种优化,所以我编写了一个小程序来查看n / 2和之间是否存在差异n * 0.5:
师:
#include <stdio.h>
#include <time.h>
int main(int argc, const char * argv[]) {
int i, m;
float n, s;
clock_t t;
m = 1000000000;
t = clock();
for(i = 0; i < m; i++) {
n = i / 2;
}
s = (float)(clock() - t) / CLOCKS_PER_SEC;
printf("n = i / 2: %d calculations took %f seconds (last calculation = %f)\n", m, s, n);
return 0;
}
Run Code Online (Sandbox Code Playgroud)
乘法:
#include <stdio.h>
#include <time.h>
int main(int argc, const char * argv[]) {
int i, m;
float n, s;
clock_t t;
m = 1000000000;
t = clock();
for(i = 0; i < m; i++) {
n = i * 0.5;
}
s = (float)(clock() - t) / CLOCKS_PER_SEC;
printf("n = i * 0.5: %d calculations took %f seconds (last calculation = %f)\n", m, s, n);
return 0;
}
Run Code Online (Sandbox Code Playgroud)
对于这两个版本我平均得到0.000002秒.编译时clang main.c -O1.他说时间测量一定有问题.然后他写了一个程序:
#include <cstdio>
#include <iostream>
#include <ctime>
using namespace std;
int main()
{
clock_t ts, te;
double dT;
int i, m;
double n, o, p, q, r, s;
m = 1000000000;
cout << "Independent calculations:\n";
ts = clock();
for (i = 0; i < m; i++)
{
// make it a trivial pure float calculation with no int casting to float
n = 11.1 / 2.3;
o = 22.2 / 2.3;
p = 33.3 / 2.3;
q = 44.4 / 2.3;
r = 55.5 / 2.3;
s = 66.6 / 2.3;
}
te = clock();
dT = ((float)(te - ts)) / CLOCKS_PER_SEC; // make initial call to get the elapsed time to run the loop
ts = clock();
printf("Division: %d calculations took %f seconds\n", m, dT);
for (i = 0; i < m; i++)
{
// make it a trivial pure float calculation with no int casting to float
n = 11.1 * 0.53;
o = 22.2 * 0.53;
p = 33.3 * 0.53;
q = 44.4 * 0.53;
r = 55.5 * 0.53;
s = 66.6 * 0.53;
}
te = clock();
dT = ((float)(te - ts)) / CLOCKS_PER_SEC; // make initial call to get the elapsed time to run the loop
ts = clock();
printf("Multiplication: %d calculations took %f seconds\n", m, dT);
cout << "\nDependent calculations:\n";
for (i = 0; i < m; i++)
{
// make it a trivial pure float calculation with no int casting to float
n = 11.1 / 2.3;
o = n / 2.3;
p = o / 2.3;
q = p / 2.3;
r = q / 2.3;
s = r / 2.3;
}
te = clock();
dT = ((float)(te - ts)) / CLOCKS_PER_SEC; // make initial call to get the elapsed time to run the loop
ts = clock();
printf("Division: %d calculations took %f seconds\n", m, dT);
for (i = 0; i < m; i++)
{
// make it a trivial pure float calculation with no int casting to float
n = 11.1 * 0.53;
o = n * 0.53;
p = o * 0.53;
q = p * 0.53;
r = q * 0.53;
s = r * 0.53;
}
te = clock();
dT = ((float)(te - ts)) / CLOCKS_PER_SEC; // make initial call to get the elapsed time to run the loop
ts = clock();
printf("Multiplication: %d calculations took %f seconds\n", m, dT);
return 0;
}
Run Code Online (Sandbox Code Playgroud)
为此,他得到了......
1.869570s
1.868254s
25.674016s
3.497555s
Run Code Online (Sandbox Code Playgroud)
...以该顺序.
所以我在编译的机器上运行程序clang++ main.cpp -O1,我得到了类似的结果:0.000002 to 0.000011.
然而,当我在没有优化的情况下编译程序时,我在第一次测试时得到了类似的结果.所以我的问题是,如何能够优化任何量使程序是多快?
wol*_*k88 19
由于代码在循环的每次迭代中没有做任何不同的事情,编译器可以自由地在循环内移动代码(结果将完全相同),并完全删除循环,让你几乎为0运行时,如你所见.
for (i = 0; i < m; i++)
{
// make it a trivial pure float calculation with no int casting to float
n = 11.1 * 0.53;
o = n * 0.53;
p = o * 0.53;
q = p * 0.53;
r = q * 0.53;
s = r * 0.53;
}
Run Code Online (Sandbox Code Playgroud)
是一个不引用i或不m包含循环引用的循环,因此编译器删除循环语句是微不足道的
这是一个很好的例子,说明对高级语言进行基准测试比基准测试组件更难(这已经足够难以完成了).编译器可以并且经常会干扰您的基准测试.
你的朋友有一个观点,一个分裂(实际分裂,而不仅仅是写/C)比乘法慢.对于双精度而言,延迟的比率大约为4,并且除法不是流水线,而乘法是,因此吞吐率更差:大约20(这些数字适用于Haswell,但是是典型的)
整数除法比浮点除法慢,但在整数上使用浮点乘法会导致两次转换.转换不是太糟糕,所以总的来说,浮点乘法仍然更快.
但是任何适当的编译器都会将整数除以常数变为整数乘法和移位,也可能是一些额外的修复内容(取决于除数和类型).以2的幂划分甚至更简单.有关详细信息,请参阅使用乘法划分不变整数.举个例子,考虑一下
int div2(int i)
{
return i / 2;
}
Run Code Online (Sandbox Code Playgroud)
GCC把它变成了
mov eax, edi
shr eax, 31
add eax, edi
sar eax
ret
Run Code Online (Sandbox Code Playgroud)
根据μarch,需要3或4个周期(不包括控制流程).
另一方面,
int div2(int i)
{
return i * 0.5;
}
Run Code Online (Sandbox Code Playgroud)
变成了
cvtsi2sd xmm0, edi
mulsd xmm0, QWORD PTR .LC0[rip]
cvttsd2si eax, xmm0
ret
.LC0:
.long 0
.long 1071644672
Run Code Online (Sandbox Code Playgroud)
这需要大约4 + 5 + 4 = 13个周期.
即使没有任何"不安全的优化",也允许编译器f / 2.0变为f * 0.5,除以2的幂相当于乘以其逆.这不适用于不是2的幂的数字.
因此,即使使用至少测量某些东西的基准,也可能不会测量正确的东西.为了测量延迟浮点除法与乘法,您可以编写如下内容:
.section data
align 16
one: dq 1.0, 1.0
.section text
_bench1:
mov ecx, -10000000
movapd xmm0, [one]
loopone:
mulpd xmm0, xmm0
mulpd xmm0, xmm0
add ecx, 1
jnz loopone
ret
_bench2:
mov ecx, -10000000
movapd xmm0, [one]
looptwo:
divpd xmm0, xmm0
divpd xmm0, xmm0
add ecx, 1
jnz looptwo
ret
Run Code Online (Sandbox Code Playgroud)
打电话给一千,包裹起来rdtsc得到时间.两者的最短时间.将时间乘以基准时钟和涡轮时钟之间的比率.那么你应该有(约)周期两个环带,划分的数量由2000万获得每循环的次数mulpd或divpd.时间分割取决于被分割的值,因此它不会给出最一般的结果.
您可能还对指令延迟和吞吐量列表感兴趣.