Nat*_*urz 22 c assembly loops intel memory-alignment
我希望我将问题简化为一个简单且可重复的测试用例.源(在此处)包含10个相同的简单循环副本.每个循环的形式如下:
#define COUNT (1000 * 1000 * 1000)
volatile uint64_t counter = 0;
void loopN(void) {
for (int j = COUNT; j != 0; j--) {
uint64_t val = counter;
val = val + 1;
counter = val;
}
return;
}
Run Code Online (Sandbox Code Playgroud)
变量的'volatile'很重要,因为它强制值在每次迭代时从内存中读取和写入.使用'-falign-loops = 64'将每个循环对齐到64个字节,并生成相同的程序集,除了对全局的相对偏移量:
400880: 48 8b 15 c1 07 20 00 mov 0x2007c1(%rip),%rdx # 601048 <counter>
400887: 48 83 c2 01 add $0x1,%rdx
40088b: 83 e8 01 sub $0x1,%eax
40088e: 48 89 15 b3 07 20 00 mov %rdx,0x2007b3(%rip) # 601048 <counter>
400895: 75 e9 jne 400880 <loop8+0x20>
Run Code Online (Sandbox Code Playgroud)
我在Intel Haswell i7-4470上运行Linux 3.11.我正在用GCC 4.8.1和命令行编译程序:
gcc -std=gnu99 -O3 -falign-loops=64 -Wall -Wextra same-function.c -o same-function
Run Code Online (Sandbox Code Playgroud)
我也在源代码中使用属性((noinline))来使程序集更清晰,但这不是观察问题所必需的.我发现shell循环中最快和最慢的函数:
for n in 0 1 2 3 4 5 6 7 8 9;
do echo same-function ${n}:;
/usr/bin/time -f "%e seconds" same-function ${n};
/usr/bin/time -f "%e seconds" same-function ${n};
/usr/bin/time -f "%e seconds" same-function ${n};
done
Run Code Online (Sandbox Code Playgroud)
它产生的结果在运行到运行时大约为1%,最快和最慢函数的确切数量根据确切的二进制布局而变化:
same-function 0:
2.08 seconds
2.04 seconds
2.06 seconds
same-function 1:
2.12 seconds
2.12 seconds
2.12 seconds
same-function 2:
2.10 seconds
2.14 seconds
2.11 seconds
same-function 3:
2.04 seconds
2.04 seconds
2.05 seconds
same-function 4:
2.05 seconds
2.00 seconds
2.03 seconds
same-function 5:
2.07 seconds
2.07 seconds
1.98 seconds
same-function 6:
1.83 seconds
1.83 seconds
1.83 seconds
same-function 7:
1.95 seconds
1.98 seconds
1.95 seconds
same-function 8:
1.86 seconds
1.88 seconds
1.86 seconds
same-function 9:
2.04 seconds
2.04 seconds
2.02 seconds
Run Code Online (Sandbox Code Playgroud)
在这种情况下,我们看到loop2()是最慢的执行之一,而loop6()是最快的之一,差异只有10%以上.我们通过使用不同的方法重复测试这两种情况来重新确认这一点:
nate@haswell$ N=2; for i in {1..10}; do perf stat same-function $N 2>&1 | grep GHz; done
7,180,104,866 cycles # 3.391 GHz
7,169,930,711 cycles # 3.391 GHz
7,150,190,394 cycles # 3.391 GHz
7,188,959,096 cycles # 3.391 GHz
7,177,272,608 cycles # 3.391 GHz
7,093,246,955 cycles # 3.391 GHz
7,210,636,865 cycles # 3.391 GHz
7,239,838,211 cycles # 3.391 GHz
7,172,716,779 cycles # 3.391 GHz
7,223,252,964 cycles # 3.391 GHz
nate@haswell$ N=6; for i in {1..10}; do perf stat same-function $N 2>&1 | grep GHz; done
6,234,770,361 cycles # 3.391 GHz
6,199,096,296 cycles # 3.391 GHz
6,213,348,126 cycles # 3.391 GHz
6,217,971,263 cycles # 3.391 GHz
6,224,779,686 cycles # 3.391 GHz
6,194,117,897 cycles # 3.391 GHz
6,225,259,274 cycles # 3.391 GHz
6,244,391,509 cycles # 3.391 GHz
6,189,972,381 cycles # 3.391 GHz
6,205,556,306 cycles # 3.391 GHz
Run Code Online (Sandbox Code Playgroud)
考虑到这一点,我们重新阅读了所有英特尔建筑手册中的每一个字,筛选整个网络上提到"计算机"或"编程"字样的每一页,并在山顶上独立思考6年.没有实现任何启示,我们归结为文明,刮胡子,洗个澡,并问StackOverflow的专家:
这可能发生什么?
编辑:在本杰明的帮助下(见下面的答案)我想出了一个更简洁的测试用例.这是一个独立的20行装配线.即使结果保持不变并且执行相同数量的指令,从使用SUB变为SBB也会导致性能差异达到15%.解释吗?我想我越来越接近一个.
; Minimal example, see also http://stackoverflow.com/q/26266953/3766665
; To build (Linux):
; nasm -felf64 func.asm
; ld func.o
; Then run:
; perf stat -r10 ./a.out
; On Haswell and Sandy Bridge, observed runtime varies
; ~15% depending on whether sub or sbb is used in the loop
section .text
global _start
_start:
push qword 0h ; put counter variable on stack
jmp loop ; jump to function
align 64 ; function alignment.
loop:
mov rcx, 1000000000
align 64 ; loop alignment.
l:
mov rax, [rsp]
add rax, 1h
mov [rsp], rax
; sbb rcx, 1h ; which is faster: sbb or sub?
sub rcx, 1h ; switch, time it, and find out
jne l ; (rot13 spoiler: foo vf snfgre ol 15%)
fin: ; If that was too easy, explain why.
mov eax, 60
xor edi, edi ; End of program. Exit with code 0
syscall
Run Code Online (Sandbox Code Playgroud)
查看完整的perf stat输出,您将看到它不是指令的数量,而是停止的周期数.
看看反汇编,我发现了两件事:
在我的机器上进行上述更改测试然后产生:
for f in $( seq 7); do perf stat -e cycles -r3 ./same-function $f 2>&1; done|grep cycles
6,070,933,420 cycles ( +- 0.11% )
6,052,771,142 cycles ( +- 0.06% )
6,099,676,333 cycles ( +- 0.07% )
6,092,962,697 cycles ( +- 0.16% )
6,151,861,993 cycles ( +- 0.69% )
6,074,323,033 cycles ( +- 0.36% )
6,174,434,653 cycles ( +- 0.65% )
不过,我不太了解你找到的摊位的性质.
编辑: 我在每个函数中都使用了一个volatile成员,在我的I7-3537U上测试了不同的编译,发现`-falign-loops = 64'实际上是最慢的:
$ gcc -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function
$ gcc -falign-loops=64 -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function-l64
$ gcc -falign-functions=64 -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function-f64
$ for prog in same-function{,-l64,-f64}; do echo $prog; for f in $(seq 7); do perf stat -e cycles -r10 ./$prog $f 2>&1; done|grep cycl; done
same-function
6,079,966,292 cycles ( +- 0.19% )
7,419,053,569 cycles ( +- 0.07% )
6,136,061,105 cycles ( +- 0.27% )
7,282,434,896 cycles ( +- 0.74% )
6,104,866,406 cycles ( +- 0.16% )
7,342,985,942 cycles ( +- 0.52% )
6,208,373,040 cycles ( +- 0.50% )
same-function-l64
7,336,838,175 cycles ( +- 0.46% )
7,358,913,923 cycles ( +- 0.52% )
7,412,570,515 cycles ( +- 0.38% )
7,435,048,756 cycles ( +- 0.10% )
7,404,834,458 cycles ( +- 0.34% )
7,291,095,582 cycles ( +- 0.99% )
7,312,052,598 cycles ( +- 0.95% )
same-function-f64
6,103,059,996 cycles ( +- 0.12% )
6,116,601,533 cycles ( +- 0.29% )
6,120,841,824 cycles ( +- 0.18% )
6,114,278,098 cycles ( +- 0.09% )
6,105,938,586 cycles ( +- 0.14% )
6,101,672,717 cycles ( +- 0.19% )
6,121,339,944 cycles ( +- 0.11% )
Run Code Online (Sandbox Code Playgroud)
有关align-loops与align-functions的更多详细信息:
$ for prog in same-function{-l64,-f64}; do sudo perf stat -d -r10 ./$prog 0; done
Performance counter stats for './same-function-l64 0' (10 runs):
2396.608194 task-clock:HG (msec) # 1.001 CPUs utilized ( +- 0.64% )
56 context-switches:HG # 0.024 K/sec ( +- 5.51% )
1 cpu-migrations:HG # 0.000 K/sec ( +- 74.78% )
46 page-faults:HG # 0.019 K/sec ( +- 0.63% )
7,331,450,530 cycles:HG # 3.059 GHz ( +- 0.51% ) [85.68%]
5,332,248,218 stalled-cycles-frontend:HG # 72.73% frontend cycles idle ( +- 0.71% ) [71.42%]
<not supported> stalled-cycles-backend:HG
5,000,800,933 instructions:HG # 0.68 insns per cycle
# 1.07 stalled cycles per insn ( +- 0.04% ) [85.73%]
1,000,446,303 branches:HG # 417.443 M/sec ( +- 0.04% ) [85.75%]
8,461 branch-misses:HG # 0.00% of all branches ( +- 6.05% ) [85.76%]
<not supported> L1-dcache-loads:HG
45,593 L1-dcache-load-misses:HG # 0.00% of all L1-dcache hits ( +- 3.61% ) [85.77%]
6,148 LLC-loads:HG # 0.003 M/sec ( +- 8.80% ) [71.36%]
<not supported> LLC-load-misses:HG
2.394456699 seconds time elapsed ( +- 0.64% )
Performance counter stats for './same-function-f64 0' (10 runs):
1998.936383 task-clock:HG (msec) # 1.001 CPUs utilized ( +- 0.61% )
60 context-switches:HG # 0.030 K/sec ( +- 17.77% )
1 cpu-migrations:HG # 0.001 K/sec ( +- 47.86% )
46 page-faults:HG # 0.023 K/sec ( +- 0.68% )
6,107,877,836 cycles:HG # 3.056 GHz ( +- 0.34% ) [85.63%]
4,112,602,649 stalled-cycles-frontend:HG # 67.33% frontend cycles idle ( +- 0.52% ) [71.41%]
<not supported> stalled-cycles-backend:HG
5,000,910,172 instructions:HG # 0.82 insns per cycle
# 0.82 stalled cycles per insn ( +- 0.01% ) [85.72%]
1,000,423,026 branches:HG # 500.478 M/sec ( +- 0.02% ) [85.77%]
10,660 branch-misses:HG # 0.00% of all branches ( +- 13.23% ) [85.80%]
<not supported> L1-dcache-loads:HG
47,492 L1-dcache-load-misses:HG # 0.00% of all L1-dcache hits ( +- 14.82% ) [85.80%]
11,719 LLC-loads:HG # 0.006 M/sec ( +- 42.44% ) [71.28%]
<not supported> LLC-load-misses:HG
1.997319759 seconds time elapsed ( +- 0.62% )
Run Code Online (Sandbox Code Playgroud)
两个可执行文件都有非常低的指令/循环计数,可能是由于循环的简约性质和它造成的内存压力,但我不知道为什么一个比另一个更差.
我也尝试过类似的东西
$ for prog in same-function{-l64,-f64}; do sudo perf stat -eL1-{d,i}cache-load-misses,L1-dcache-store-misses,cs,cycles,instructions -r10 ./$prog 0; done
Run Code Online (Sandbox Code Playgroud)
和
$ sudo perf record -F25000 -e'{cycles:pp,stalled-cycles-frontend}' ./same-function-l64 0
[ perf record: Woken up 28 times to write data ]
[ perf record: Captured and wrote 6.771 MB perf.data (~295841 samples) ]
$ sudo perf report --group -Sloop0 -n --show-total-period --stdio
$ sudo perf annotate --group -sloop0 --stdio
Run Code Online (Sandbox Code Playgroud)
但找不到罪魁祸首没有任何成功.尽管如此,我觉得无论如何都要在这里记下它可能会有所帮助......
编辑2: 这是我对same-function.c的补丁:
$ git diff -u -U0
diff --git a/same-function.c b/same-function.c
index f78449e..78a5772 100644
--- a/same-function.c
+++ b/same-function.c
@@ -20 +20 @@ done
-volatile uint64_t counter = 0;
+//volatile uint64_t counter = 0;
@@ -22,0 +23 @@ COMPILER_NO_INLINE void loop0(void) {
+volatile uint64_t counter = 0;
@@ -31,0 +33 @@ COMPILER_NO_INLINE void loop1(void) {
+volatile uint64_t counter = 0;
@@ -40,0 +43 @@ COMPILER_NO_INLINE void loop2(void) {
+volatile uint64_t counter = 0;
@@ -49,0 +53 @@ COMPILER_NO_INLINE void loop3(void) {
+volatile uint64_t counter = 0;
@@ -58,0 +63 @@ COMPILER_NO_INLINE void loop4(void) {
+volatile uint64_t counter = 0;
@@ -67,0 +73 @@ COMPILER_NO_INLINE void loop5(void) {
+volatile uint64_t counter = 0;
@@ -76,0 +83 @@ COMPILER_NO_INLINE void loop6(void) {
+volatile uint64_t counter = 0;
@@ -85,0 +93 @@ COMPILER_NO_INLINE void loop7(void) {
+volatile uint64_t counter = 0;
@@ -94,0 +103 @@ COMPILER_NO_INLINE void loop8(void) {
+volatile uint64_t counter = 0;
@@ -103,0 +113 @@ COMPILER_NO_INLINE void loop9(void) {
+volatile uint64_t counter = 0;
@@ -135 +145 @@ int main(int argc, char** argv) {
-}
\ No newline at end of file
+}
Run Code Online (Sandbox Code Playgroud)
编辑3:更简单的例子也是如此:
; Minimal example, see also http://stackoverflow.com/q/26266953/3766665
; To build (Linux):
; nasm -felf64 func.asm
; ld func.o
; Then run:
; perf stat -r10 ./a.out
; Runtime varies ~10% depending on whether
section .text
global _start
_start:
push qword 0h ; put counter variable on stack
jmp loop ; jump to function
;align 64 ; function alignment. Try commenting this
loop:
mov rcx, 1000000000
;align 64 ; loop alignment. Try commenting this
l:
mov rax, [rsp]
add rax, 1h
mov [rsp], rax
sub rcx, 1h
jne l
fin: ; End of program. Exit with code 0
mov eax, 60
xor edi, edi
syscall
Run Code Online (Sandbox Code Playgroud)
这里效果相同.有趣.
干杯,本杰明
A few years ago, I would have told you to check for any difference in the internal state of the CPU when it arrives at any of those loops; as this was known to have a profound effect on the capacity of the Out-of-Order predictions process (or something like that). For example, the performance of the very same loop could change by up to something like 15-20% depending on what the CPU was doing just before entering the loop and just the fact of jumping from two different points could be sufficient to change the execution speed.
In your case, it's quite to test. All you have to do is to change the order of the instructions in the IF block; for example replacing the following:
switch (firstLetter) {
case '0': loop0(); break;
case '1': loop1(); break;
case '2': loop2(); break;
case '3': loop3(); break;
case '4': loop4(); break;
case '5': loop5(); break;
case '6': loop6(); break;
case '7': loop7(); break;
case '8': loop8(); break;
case '9': loop9(); break;
default: goto die_usage;
}
Run Code Online (Sandbox Code Playgroud)
with:
switch (firstLetter) {
case '9': loop9(); break;
case '8': loop8(); break;
case '7': loop7(); break;
case '6': loop6(); break;
case '5': loop5(); break;
case '4': loop4(); break;
case '3': loop3(); break;
case '2': loop2(); break;
case '1': loop1(); break;
case '0': loop0(); break;
default: goto die_usage;
}
Run Code Online (Sandbox Code Playgroud)
或以任何随机顺序。当然,您应该检查生成的汇编代码,以确保编译器没有重新排序指令的顺序。
另外,由于您的循环位于各个函数内;您还应该确保这些函数本身在 64 字节边界上对齐。