Meh*_*dad 23 c x86 assembly cpu-registers micro-optimization
我一直看到人们声称MOV指令可以在x86中免费,因为寄存器重命名.
对于我的生活,我无法在一个测试用例中验证这一点.每个测试用例我尝试揭穿它.
例如,这是我用Visual C++编译的代码:
#include <limits.h>
#include <stdio.h>
#include <time.h>
int main(void)
{
unsigned int k, l, j;
clock_t tstart = clock();
for (k = 0, j = 0, l = 0; j < UINT_MAX; ++j)
{
++k;
k = j; // <-- comment out this line to remove the MOV instruction
l += j;
}
fprintf(stderr, "%d ms\n", (int)((clock() - tstart) * 1000 / CLOCKS_PER_SEC));
fflush(stderr);
return (int)(k + j + l);
}
Run Code Online (Sandbox Code Playgroud)
这为循环生成以下汇编代码(随意生成这个你想要的;你显然不需要Visual C++):
LOOP:
add edi,esi
mov ebx,esi
inc esi
cmp esi,FFFFFFFFh
jc LOOP
Run Code Online (Sandbox Code Playgroud)
现在我运行了几次这个程序,当移除MOV指令时,我观察到相当一致的2%差异:
Without MOV With MOV
1303 ms 1358 ms
1324 ms 1363 ms
1310 ms 1345 ms
1304 ms 1343 ms
1309 ms 1334 ms
1312 ms 1336 ms
1320 ms 1311 ms
1302 ms 1350 ms
1319 ms 1339 ms
1324 ms 1338 ms
Run Code Online (Sandbox Code Playgroud)
什么给出了什么?为什么MOV不"免费"?这个循环对于x86来说太复杂了吗?
是否有一个单一的例子,有可以证明MOV是自由喜欢的人要求?
如果是这样,它是什么?如果没有,为什么每个人都声称MOV是免费的?
Pet*_*des 31
问题中循环的吞吐量不依赖于MOV 的延迟,或(在Haswell上)不使用执行单元的好处.
对于前端发出无序核心,循环仍然只有4微秒.(mov
即使它不需要执行单元,仍然必须被无序核心跟踪,但cmp/jc
宏观融合到单个uop中).
自Core2以来,Intel CPU的每个时钟的发布宽度为4 uop,因此mov
不会阻止它在Haswell上每个时钟执行一次iter(接近).它也会在Ivybridge上每次运行一次(使用mov-elimination),但不会在Sandybridge上运行(没有mov-elimination). 在SnB上,它将是每1.333c周期大约一个,这是ALU吞吐量的瓶颈因为mov
总是需要一个.(SnB/IvB只有三个ALU端口,而Haswell有四个).
注意在重命名阶段是特殊处理一直的x87 FXCHG(互换的事情st0
用st1
)比MOV长得多.瓦格纳雾列出FXCHG为0延迟上的PPro/PII/PIII(第一代P6芯).
问题中的循环有两个互锁的依赖链(add edi,esi
取决于EDI和循环计数器ESI),这使得它对不完美的调度更加敏感.由于看似无关的指令,理论预测下降2%并不罕见,指令顺序的微小变化可能会产生这种差异.要以每个1c的速度运行,每个周期都需要运行INC和ADD.由于所有INC和ADD都依赖于前一次迭代,因此无序执行无法通过在单个循环中运行两个来实现.更糟糕的是,ADD依赖于前一周期中的INC,这就是我所说的"互锁",因此在INC dep链中丢失一个周期也会使ADD dep链失速.
此外,预测采用的分支只能在port6上运行,因此port6不执行cmp/jc的任何循环都是丢失吞吐量的循环.每当INC或ADD在port6上窃取一个循环而不是在端口0,1或5上运行时就会发生这种情况.如果这是罪魁祸首,或者如果INC/ADD dep链中的丢失循环本身就是问题,或者可能两者中的一些.
添加额外的MOV不会增加任何执行端口压力,假设它已经消除了100%,但它确实阻止了前端在执行核心之前运行.(循环中4个uop中只有3个需要一个执行单元,Haswell CPU可以在其4个ALU端口中的任何一个上运行INC和ADD:0,1,5和6.所以瓶颈是:
esi
(每个时钟的INC延迟为1)edi
(每个时钟的ADD延迟为1,并且还取决于前一次迭代的INC)如果没有MOV,前端可以每时钟4个发出循环的三个微指令,直到无序核心已满.(AFAICT,它在循环缓冲区(循环流检测器:LSD)中"展开"微循环,因此具有ABC uops的循环可以在ABCA BCAB CABC ...模式中发出.per perf计数器用于lsd.cycles_4_uops
确认它主要发出发出任何uops的4个小组.)
英特尔CPU在发送到无序核心时将uops分配给端口.该决定基于计数器,该计数器跟踪调度程序(也称为Reservation Station,RS)中每个端口的uops数量.当RS中有大量的uop等待执行时,这很有效,通常应避免将INC或ADD调度到port6.而且我猜也避免安排INC和ADD,以便从这些dep链中的任何一个丢失时间.但是如果RS为空或接近空,则计数器不会阻止ADD或INC在端口6上窃取一个周期.
我以为我在这里做了一些事情,但任何次优调度应该让前端赶上并保持后端满.我认为我们不应该期望前端在管道中产生足够的气泡来解释低于最大吞吐量2%的下降,因为微小的环路应该以非常一致的4个时钟吞吐量从环路缓冲器运行.也许还有别的事情发生了.
mov
消除的好处的一个真实的例子.我曾经lea
构建了一个mov
每个时钟只有一个的循环,创建了一个完美的演示,其中MOV消除成功100%,或0%的时间mov same,same
用于演示产生的延迟瓶颈.
由于宏融合dec/jnz
是涉及循环计数器的依赖链的一部分,不完美的调度不能延迟它. 这与cmp/jc
每次迭代从关键路径依赖链中"分离" 的情况不同.
_start:
mov ecx, 2000000000 ; each iteration decrements by 2, so this is 1G iters
align 16 ; really align 32 makes more sense in case the uop-cache comes into play, but alignment is actually irrelevant for loops that fit in the loop buffer.
.loop:
mov eax, ecx
lea ecx, [rax-1] ; we vary these two instructions
dec ecx ; dec/jnz macro-fuses into one uop in the decoders, on Intel
jnz .loop
.end:
xor edi,edi ; edi=0
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)
Run Code Online (Sandbox Code Playgroud)
在Intel SnB系列中,在寻址模式下具有一个或两个组件的LEA以1c延迟运行(请参阅http://agner.org/optimize/和x86标签wiki 中的其他链接).
我在Linux上构建并运行它作为静态二进制文件,因此整个过程的用户空间perf计数器仅测量循环,启动/关闭开销可忽略不计.(perf stat
与将perf-counter查询放入程序本身相比,真的很容易)
$ yasm -felf64 -Worphan-labels -gdwarf2 mov-elimination.asm && ld -o mov-elimination mov-elimination.o &&
objdump -Mintel -drwC mov-elimination &&
taskset -c 1 ocperf.py stat -etask-clock,context-switches,page-faults,cycles,instructions,branches,uops_issued.any,uops_executed.thread -r2 ./mov-elimination
Disassembly of section .text:
00000000004000b0 <_start>:
4000b0: b9 00 94 35 77 mov ecx,0x77359400
4000b5: 66 66 2e 0f 1f 84 00 00 00 00 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
00000000004000c0 <_start.loop>:
4000c0: 89 c8 mov eax,ecx
4000c2: 8d 48 ff lea ecx,[rax-0x1]
4000c5: ff c9 dec ecx
4000c7: 75 f7 jne 4000c0 <_start.loop>
00000000004000c9 <_start.end>:
4000c9: 31 ff xor edi,edi
4000cb: b8 e7 00 00 00 mov eax,0xe7
4000d0: 0f 05 syscall
perf stat -etask-clock,context-switches,page-faults,cycles,instructions,branches,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xb1,umask=0x1,name=uops_executed_thread/ -r2 ./mov-elimination
Performance counter stats for './mov-elimination' (2 runs):
513.242841 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.05% )
0 context-switches:u # 0.000 K/sec
1 page-faults:u # 0.002 K/sec
2,000,111,934 cycles:u # 3.897 GHz ( +- 0.00% )
4,000,000,161 instructions:u # 2.00 insn per cycle ( +- 0.00% )
1,000,000,157 branches:u # 1948.396 M/sec ( +- 0.00% )
3,000,058,589 uops_issued_any:u # 5845.300 M/sec ( +- 0.00% )
2,000,037,900 uops_executed_thread:u # 3896.865 M/sec ( +- 0.00% )
0.513402352 seconds time elapsed ( +- 0.05% )
Run Code Online (Sandbox Code Playgroud)
正如预期的那样,循环运行1G次(branches
〜= 10亿).超出2G的"额外"111k周期也是其他测试中出现的开销,包括没有的mov
.这不是从MOV,消除偶然的失败,但它确实与迭代次数比例,以便它不只是启动开销.它可能来自定时器中断,因为IIRC Linux perf
在处理中断时不会乱用perf- counter ,只是让它们继续计数.(perf
虚拟化硬件性能计数器,这样即使线程跨CPU迁移,也可以获得每个进程计数.)此外,共享同一物理内核的兄弟逻辑核心上的定时器中断会稍微扰乱一些事情.
瓶颈是涉及循环计数器的循环携带依赖链.1G iters的2G周期为每次迭代2个时钟,或每次递减1个时钟.确认dep链的长度为2个循环. 只有mov
零延迟才能实现这一点.(我知道它并不能证明没有其他瓶颈.如果你不相信我认为延迟是唯一的瓶颈,它实际上只能证明延迟最多是 2个周期.有一个性能resource_stalls.any
计数器,但它没有很多选择来分解哪些微架构资源已经耗尽.)
循环有3个稠域微指令:mov
,lea
,和宏融合dec/jnz
.3G uops_issued.any
计数证实:它在融合域中计数,这是从解码器到退役的所有流水线,除了调度程序(RS)和执行单元.(宏融合指令对在任何地方都保持单个uop.只有用于存储的微融合或ALU +负载,ROB中的1个融合域uop 跟踪两个未融合域uop的进度.)
2G uops_executed.thread
(unfused-domain)告诉我们所有的mov
uop都被消除了(即由发出/重命名阶段处理,并且在已经执行的状态下放置在ROB中).他们仍然占用问题/退休带宽,uop缓存中的空间和代码大小.它们占用了ROB中的空间,限制了无序窗口大小. 一个mov
指令是永远免费的.除了延迟和执行端口之外,还存在许多可能的微体系结构瓶颈,最重要的通常是前端的4宽发布率.
On Intel CPUs, being zero latency is often a bigger deal than not needing an execution unit, especially in Haswell and later where there are 4 ALU ports. (But only 3 of them can handle vector uops, so non-eliminated vector moves would be a bottleneck more easily, especially in code without many loads or stores taking front-end bandwidth (4 fused-domain uops per clock) away from ALU uops. Also, scheduling uops to execution units isn't perfect (more like oldest-ready first), so uops that aren't on the critical path can steal cycles from the critical path.)
If we put a nop
or an xor edx,edx
into the loop, those would also issue but not execute on Intel SnB-family CPUs.
Zero-latency mov-elimination can be useful for zero-extending from 32 to 64 bits, and for 8 to 64. (movzx eax, bl
is eliminated, movzx eax, bx
isn't).
mov ecx, ecx # CPUs can't eliminate mov same,same
lea ecx, [rcx-1]
dec ecx
jnz .loop
3,000,320,972 cycles:u # 3.898 GHz ( +- 0.00% )
4,000,000,238 instructions:u # 1.33 insn per cycle ( +- 0.00% )
1,000,000,234 branches:u # 1299.225 M/sec ( +- 0.00% )
3,000,084,446 uops_issued_any:u # 3897.783 M/sec ( +- 0.00% )
3,000,058,661 uops_executed_thread:u # 3897.750 M/sec ( +- 0.00% )
Run Code Online (Sandbox Code Playgroud)
This takes 3G cycles for 1G iterations, because the length of the dependency chain is now 3 cycles.
The fused-domain uop count didn't change, still 3G.
What did change is that now the unfused-domain uop count is the same as fused-domain. All the uops needed an execution unit; none of the mov same,same
instructions were eliminated, so they all added 1c latency to the loop-carried dep chain.
(When there are micro-fused uops, like vmovdqa xmm,xmm
, the movzx eax,al
count can be higher than mov
. But we don't have that.)
mov
at all:lea ecx, [rcx-1]
dec ecx
jnz .loop
2,000,131,323 cycles:u # 3.896 GHz ( +- 0.00% )
3,000,000,161 instructions:u # 1.50 insn per cycle
1,000,000,157 branches:u # 1947.876 M/sec
2,000,055,428 uops_issued_any:u # 3895.859 M/sec ( +- 0.00% )
2,000,039,061 uops_executed_thread:u # 3895.828 M/sec ( +- 0.00% )
Run Code Online (Sandbox Code Playgroud)
Now we're back down to 2 cycle latency for the loop-carried dep chain.
Nothing is eliminated.
I tested on a 3.9GHz i7-6700k Skylake. I get identical results on a Haswell i5-4210U (to within 40k out of 1G counts) for all the perf events. That's about the same margin of error as re-running on the same system.
Note that if I ran add eax, [rsi]
as root, and counted uops_executed
instead of uops_issued
(user-space only), it measure the CPU frequency as exactly 3.900 GHz. (IDK why Linux only obeys the bios-settings for max turbo right after reboot, but then drops to 3.9GHz if I leave it idle for a couple minutes. Asus Z170 Pro Gaming mobo, Arch Linux with kernel 4.10.11-1-ARCH. Saw the same thing with Ubuntu. Writing mov
to each of perf
from cycles
fixes it, but writing cycles:u
makes it drop back to 3.9GHz again later.)
You should get the same results on AMD Ryzen, since it can eliminate integer balance_performance
. AMD Bulldozer-family can only eliminate xmm register copies. (According to Agner Fog, /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference
register copies are an eliminated low-half and an ALU op for the high half.)
For example, AMD Bulldozer and Intel Ivybridge can sustain a throughput of 1 per clock for
movaps xmm0, xmm1
movaps xmm2, xmm3
movaps xmm4, xmm5
dec
jnz .loop
Run Code Online (Sandbox Code Playgroud)
But Intel Sandybridge can't eliminate moves, so it would bottleneck on 4 ALU uops for 3 execution ports. If it was /etc/rc.local
instead of movaps, SnB could also sustain one iteration per clock. (But Bulldozer-family couldn't, because xor-zeroing still needs an execution unit on AMD, even though it's independent of the old value of the register. And Bulldozer-family only has 0.5c throughput for PXOR.)
Two dependent MOV instructions in a row exposes a difference between Haswell and Skylake.
.loop:
mov eax, ecx
mov ecx, eax
sub ecx, 2
jnz .loop
Run Code Online (Sandbox Code Playgroud)
Haswell: minor run-to-run variability (1.746 to 1.749 c/iter), but this is typical:
1,749,102,925 cycles:u # 2.690 GHz
4,000,000,212 instructions:u # 2.29 insn per cycle
1,000,000,208 branches:u # 1538.062 M/sec
3,000,079,561 uops_issued_any:u # 4614.308 M/sec
1,746,698,502 uops_executed_core:u # 2686.531 M/sec
745,676,067 lsd_cycles_4_uops:u # 1146.896 M/sec
Run Code Online (Sandbox Code Playgroud)
Not all the MOV instructions are eliminated: about 0.75 of the 2 per iteration used an execution port. Every MOV that executes instead of being eliminated adds 1c of latency to the loop-carried dep chain, so it's not a coincidence that balance_power
and sudo perf
are very similar. All the uops are part of a single dependency chain, so there's no parallelism possible. kernel.perf_event_paranoid = 0
is always about 5M higher than /etc/syctl.d/99-local.conf
regardless of run-to-run variation, so I guess there are just 5M cycles being used up somewhere else.
Skylake: more stable than HSW results, and more mov-elimination: only 0.6666 MOVs of every 2 needed an execution unit.
1,666,716,605 cycles:u # 3.897 GHz
4,000,000,136 instructions:u # 2.40 insn per cycle
1,000,000,132 branches:u # 2338.050 M/sec
3,000,059,008 uops_issued_any:u # 7014.288 M/sec
1,666,548,206 uops_executed_thread:u # 3896.473 M/sec
666,683,358 lsd_cycles_4_uops:u # 1558.739 M/sec
Run Code Online (Sandbox Code Playgroud)
On Haswell, mov
accounted for all of the uops. (0.745*4 ~= 3). So in almost every cycle where any uops are issued, a full group of 4 is issued (from the loop-buffer. I probably should have looked at a different counter that doesn't care where they came from, like ymm
to count cycles where no uops issued).
But on SKL, pxor xmm0,xmm0
is less than 3, so in some cycles the front-end issued fewer than 4 uops. (Usually it stalls until there's room in the out-of-order core to issue a full group of 4, instead of issuing non-full groups).
It's weird, IDK what the exact microarchitectural limitation is. Since the loop is only 3 uops, each issue-group of 4 uops is more than a full iteration. So an issue group can contain up to 3 dependent MOVs. Perhaps Skylake is designed to break that up sometimes, to allow more mov-elimination?
update: actually this is normal for 3-uop loops on Skylake. uops_executed
shows that HSW and SKL issue a simple 3 uop loop with no mov-elimination the same way they issue this one. So better mov-elimination is a side-effect of splitting up issue groups for some other reason. (It's not a bottleneck because taken branches can't execute faster than 1 per clock regardless of how fast they issue). I still don't know why SKL is different, but I don't think it's anything to worry about.
In a less extreme case, SKL and HSW are the same, with both failing to eliminate 0.3333 of every 2 MOV instructions:
.loop:
mov eax, ecx
dec eax
mov ecx, eax
sub ecx, 1
jnz .loop
Run Code Online (Sandbox Code Playgroud)
2,333,434,710 cycles:u # 3.897 GHz
5,000,000,185 instructions:u # 2.14 insn per cycle
1,000,000,181 branches:u # 1669.905 M/sec
4,000,061,152 uops_issued_any:u # 6679.720 M/sec
2,333,374,781 uops_executed_thread:u # 3896.513 M/sec
1,000,000,942 lsd_cycles_4_uops:u # 1669.906 M/sec
Run Code Online (Sandbox Code Playgroud)
All of the uops issue in groups of 4. Any contiguous group of 4 uops will contain exactly two MOV uops that are candidates for elimination. Since it clearly succeeds at eliminating both in some cycles, IDK why it can't always do that.
Intel's optimization manual says that overwriting the result of mov-elimination as early as possible frees up the microarchitectural resources so it can succeed more often, at least for cycles
. See Example 3-25. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions.
So maybe it's tracked internally with a limited-size table of ref-counts? Something has to stop the physical register file entry from being freed when it's no longer needed as the value of the original architectural register, if it's still needed as the value of the mov destination. Freeing PRF entries as soon as possible is key, because PRF size can limit the out-of-order window to smaller than ROB size.
I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the cycles
instructions.
See also Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for more research + guesswork about how mov-elimination works, and whether it could work for uops_executed
. (In practice lsd.cycles_4_uops
is 3 ALU uops on Intel, but 2 eliminated uops on Ryzen. It's interesting to guess whether Intel could have implemented it more efficiently.)
顺便说一句,作为Haswell的错误解决方案,Linux不提供uops_issued.stall_cycles
何时启用超线程0.66666 * 4 = 2.66664
.另一个核心绝对是空闲的,甚至没有计时器中断,因为我把它脱机了uops_issued.stall_cycles
.不幸的是,在movzx
决定启用HT 之前无法完成此操作,而且我的戴尔笔记本电脑没有禁用HT的BIOS选项.所以我无法movzx
在该系统上同时使用所有8个硬件PMU计数器,只有4.:/
har*_*old 11
以下是两个小测试,我相信这些测试最终会显示出消除移动的证据:
__loop1:
add edx, 1
add edx, 1
add ecx, 1
jnc __loop1
Run Code Online (Sandbox Code Playgroud)
与
__loop2:
mov eax, edx
add eax, 1
mov edx, eax
add edx, 1
add ecx, 1
jnc __loop2
Run Code Online (Sandbox Code Playgroud)
如果mov
向依赖关系链添加一个循环,则预计第二个版本每次迭代需要大约4个循环.在我的Haswell上,每次迭代都需要大约2个周期,没有mov-elimination就不会发生.
归档时间: |
|
查看次数: |
2113 次 |
最近记录: |