x86的MOV真的可以"免费"吗?为什么我不能重现这个呢?

Meh*_*dad 23 c x86 assembly cpu-registers micro-optimization

我一直看到人们声称MOV指令可以在x86中免费,因为寄存器重命名.

对于我的生活,我无法在一个测试用例中验证这一点.每个测试用例我尝试揭穿它.

例如,这是我用Visual C++编译的代码:

#include <limits.h>
#include <stdio.h>
#include <time.h>

int main(void)
{
    unsigned int k, l, j;
    clock_t tstart = clock();
    for (k = 0, j = 0, l = 0; j < UINT_MAX; ++j)
    {
        ++k;
        k = j;     // <-- comment out this line to remove the MOV instruction
        l += j;
    }
    fprintf(stderr, "%d ms\n", (int)((clock() - tstart) * 1000 / CLOCKS_PER_SEC));
    fflush(stderr);
    return (int)(k + j + l);
}
Run Code Online (Sandbox Code Playgroud)

这为循环生成以下汇编代码(随意生成这个你想要的;你显然不需要Visual C++):

LOOP:
    add edi,esi
    mov ebx,esi
    inc esi
    cmp esi,FFFFFFFFh
    jc  LOOP
Run Code Online (Sandbox Code Playgroud)

现在我运行了几次这个程序,当移除MOV指令时,我观察到相当一致的2%差异:

Without MOV      With MOV
  1303 ms         1358 ms
  1324 ms         1363 ms
  1310 ms         1345 ms
  1304 ms         1343 ms
  1309 ms         1334 ms
  1312 ms         1336 ms
  1320 ms         1311 ms
  1302 ms         1350 ms
  1319 ms         1339 ms
  1324 ms         1338 ms
Run Code Online (Sandbox Code Playgroud)

什么给出了什么?为什么MOV不"免费"?这个循环对于x86来说太复杂了吗?
是否有一个单一的例子,有可以证明MOV是自由喜欢的人要求?
如果是这样,它是什么?如果没有,为什么每个人都声称MOV是免费的?

Pet*_*des 31

问题中循环的吞吐量不依赖于MOV 的延迟,或(在Haswell上)不使用执行单元的好处.

对于前端发出无序核心,循环仍然只有4微秒.(mov即使它不需要执行单元,仍然必须被无序核心跟踪,但cmp/jc宏观融合到单个uop中).

自Core2以来,Intel CPU的每个时钟的发布宽度为4 uop,因此mov不会阻止它在Haswell上每个时钟执行一次iter(接近).它也会在Ivybridge上每次运行一次(使用mov-elimination),但不会在Sandybridge上运行(没有mov-elimination). 在SnB上,它将是每1.333c周期大约一个,这是ALU吞吐量的瓶颈因为mov总是需要一个.(SnB/IvB只有三个ALU端口,而Haswell有四个).

注意在重命名阶段是特殊处理一直的x87 FXCHG(互换的事情st0st1)比MOV长得多.瓦格纳雾列出FXCHG为0延迟上的PPro/PII/PIII(第一代P6芯).


问题中的循环有两个互锁的依赖链(add edi,esi取决于EDI和循环计数器ESI),这使得它对不完美的调度更加敏感.由于看似无关的指令,理论预测下降2%并不罕见,指令顺序的微小变化可能会产生这种差异.要以每个1c的速度运行,每个周期都需要运行INC和ADD.由于所有INC和ADD都依赖于前一次迭代,因此无序执行无法通过在单个循环中运行两个来实现.更糟糕的是,ADD依赖于前一周期中的INC,这就是我所说的"互锁",因此在INC dep链中丢失一个周期也会使ADD dep链失速.

此外,预测采用的分支只能在port6上运行,因此port6不执行cmp/jc的任何循环都是丢失吞吐量的循环.每当INC或ADD在port6上窃取一个循环而不是在端口0,1或5上运行时就会发生这种情况.如果这是罪魁祸首,或者如果INC/ADD dep链中的丢失循环本身就是问题,或者可能两者中的一些.

添加额外的MOV不会增加任何执行端口压力,假设它已经消除了100%,但它确实阻止了前端在执行核心之前运行.(循环中4个uop中只有3个需要一个执行单元,Haswell CPU可以在其4个ALU端口中的任何一个上运行INC和ADD:0,1,5和6.所以瓶颈是:

  • 每个时钟的前端最大吞吐量为4 uop.(没有MOV的循环只有3个uop,因此前端可以向前运行).
  • 每个时钟采用一个分支吞吐量.
  • 涉及的依赖链esi(每个时钟的INC延迟为1)
  • 涉及的依赖链edi(每个时钟的ADD延迟为1,并且还取决于前一次迭代的INC)

如果没有MOV,前端可以每时钟4个发出循环的三个微指令,直到无序核心已满.(AFAICT,它在循环缓冲区(循环流检测器:LSD)中"展开"微循环,因此具有ABC uops的循环可以在ABCA BCAB CABC ...模式中发出.per perf计数器用于lsd.cycles_4_uops确认它主要发出发出任何uops的4个小组.)

英特尔CPU在发送到无序核心时将uops分配给端口.该决定基于计数器,该计数器跟踪调度程序(也称为Reservation Station,RS)中每个端口的uops数量.当RS中有大量的uop等待执行时,这很有效,通常应避免将INC或ADD调度到port6.而且我猜也避免安排INC和ADD,以便从这些dep链中的任何一个丢失时间.但是如果RS为空或接近空,则计数器不会阻止ADD或INC在端口6上窃取一个周期.

我以为我在这里做了一些事情,但任何次优调度应该让前端赶上并保持后端满.我认为我们不应该期望前端在管道中产生足够的气泡来解释低于最大吞吐量2%的下降,因为微小的环路应该以非常一致的4个时钟吞吐量从环路缓冲器运行.也许还有别的事情发生了.


mov消除的好处的一个真实的例子.

我曾经lea构建了一个mov每个时钟只有一个的循环,创建了一个完美的演示,其中MOV消除成功100%,或0%的时间mov same,same用于演示产生的延迟瓶颈.

由于宏融合dec/jnz是涉及循环计数器的依赖链的一部分,不完美的调度不能延迟它. 这与cmp/jc每次迭代从关键路径依赖链中"分离" 的情况不同.

_start:
    mov     ecx, 2000000000 ; each iteration decrements by 2, so this is 1G iters
align 16  ; really align 32 makes more sense in case the uop-cache comes into play, but alignment is actually irrelevant for loops that fit in the loop buffer.
.loop:
    mov eax, ecx
    lea ecx, [rax-1]    ; we vary these two instructions

    dec ecx             ; dec/jnz macro-fuses into one uop in the decoders, on Intel
    jnz .loop

.end:
    xor edi,edi    ; edi=0
    mov eax,231    ; __NR_exit_group from /usr/include/asm/unistd_64.h
    syscall        ; sys_exit_group(0)
Run Code Online (Sandbox Code Playgroud)

在Intel SnB系列中,在寻址模式下具有一个或两个组件的LEA以1c延迟运行(请参阅http://agner.org/optimize/标签wiki 中的其他链接).

我在Linux上构建并运行它作为静态二进制文件,因此整个过程的用户空间perf计数器仅测量循环,启动/关闭开销可忽略不计.(perf stat与将perf-counter查询放入程序本身相比,真的很容易)

$ yasm -felf64 -Worphan-labels -gdwarf2 mov-elimination.asm && ld -o mov-elimination mov-elimination.o &&
  objdump -Mintel -drwC mov-elimination &&
  taskset -c 1 ocperf.py stat -etask-clock,context-switches,page-faults,cycles,instructions,branches,uops_issued.any,uops_executed.thread  -r2 ./mov-elimination

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:       b9 00 94 35 77          mov    ecx,0x77359400
  4000b5:       66 66 2e 0f 1f 84 00 00 00 00 00        data16 nop WORD PTR cs:[rax+rax*1+0x0]

00000000004000c0 <_start.loop>:
  4000c0:       89 c8                   mov    eax,ecx
  4000c2:       8d 48 ff                lea    ecx,[rax-0x1]
  4000c5:       ff c9                   dec    ecx
  4000c7:       75 f7                   jne    4000c0 <_start.loop>

00000000004000c9 <_start.end>:
  4000c9:       31 ff                   xor    edi,edi
  4000cb:       b8 e7 00 00 00          mov    eax,0xe7
  4000d0:       0f 05                   syscall 

perf stat -etask-clock,context-switches,page-faults,cycles,instructions,branches,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xb1,umask=0x1,name=uops_executed_thread/ -r2 ./mov-elimination

 Performance counter stats for './mov-elimination' (2 runs):

    513.242841      task-clock:u (msec)       #    1.000 CPUs utilized    ( +-  0.05% )
             0      context-switches:u        #    0.000 K/sec                  
             1      page-faults:u             #    0.002 K/sec                  
 2,000,111,934      cycles:u                  #    3.897 GHz              ( +-  0.00% )
 4,000,000,161      instructions:u            #    2.00  insn per cycle   ( +-  0.00% )
 1,000,000,157      branches:u                # 1948.396 M/sec            ( +-  0.00% )
 3,000,058,589      uops_issued_any:u         # 5845.300 M/sec            ( +-  0.00% )
 2,000,037,900      uops_executed_thread:u    # 3896.865 M/sec            ( +-  0.00% )

   0.513402352 seconds time elapsed                                          ( +-  0.05% )
Run Code Online (Sandbox Code Playgroud)

正如预期的那样,循环运行1G次(branches〜= 10亿).超出2G的"额外"111k周期也是其他测试中出现的开销,包括没有的mov.这不是从MOV,消除偶然的失败,但它确实与迭代次数比例,以便它不只是启动开销.它可能来自定时器中断,因为IIRC Linux perf在处理中断时不会乱用perf- counter ,只是让它们继续计数.(perf虚拟化硬件性能计数器,这样即使线程跨CPU迁移,也可以获得每个进程计数.)此外,共享同一物理内核的兄弟逻辑核心上的定时器中断会稍微扰乱一些事情.

瓶颈是涉及循环计数器的循环携带依赖链.1G iters的2G周期为每次迭代2个时钟,或每次递减1个时钟.确认dep链的长度为2个循环. 只有mov零延迟才能实现这一点.(我知道它并不能证明没有其他瓶颈.如果你不相信我认为延迟是唯一的瓶颈,它实际上只能证明延迟最多是 2个周期.有一个性能resource_stalls.any计数器,但它没有很多选择来分解哪些微架构资源已经耗尽.)

循环有3个稠域微指令:mov,lea,和宏融合dec/jnz.3G uops_issued.any计数证实:它在融合域中计数,这是从解码器到退役的所有流水线,除了调度程序(RS)和执行单元.(宏融合指令对在任何地方都保持单个uop.只有用于存储的微融合或ALU +负载,ROB中的1个融合域uop 跟踪两个未融合域uop的进度.)

2G uops_executed.thread(unfused-domain)告诉我们所有的movuop都被消除了(即由发出/重命名阶段处理,并且在已经执行的状态下放置在ROB中).他们仍然占用问题/退休带宽,uop缓存中的空间和代码大小.它们占用了ROB中的空间,限制了无序窗口大小. 一个mov指令是永远免费的.除了延迟和执行端口之外,还存在许多可能的微体系结构瓶颈,最重要的通常是前端的4宽发布率.

On Intel CPUs, being zero latency is often a bigger deal than not needing an execution unit, especially in Haswell and later where there are 4 ALU ports. (But only 3 of them can handle vector uops, so non-eliminated vector moves would be a bottleneck more easily, especially in code without many loads or stores taking front-end bandwidth (4 fused-domain uops per clock) away from ALU uops. Also, scheduling uops to execution units isn't perfect (more like oldest-ready first), so uops that aren't on the critical path can steal cycles from the critical path.)

If we put a nop or an xor edx,edx into the loop, those would also issue but not execute on Intel SnB-family CPUs.

Zero-latency mov-elimination can be useful for zero-extending from 32 to 64 bits, and for 8 to 64. (movzx eax, bl is eliminated, movzx eax, bx isn't).


Without mov-elimination

mov ecx, ecx      # CPUs can't eliminate  mov same,same
lea ecx, [rcx-1]

dec ecx
jnz .loop

 3,000,320,972      cycles:u                  #    3.898 GHz                      ( +-  0.00% )
 4,000,000,238      instructions:u            #    1.33  insn per cycle           ( +-  0.00% )
 1,000,000,234      branches:u                # 1299.225 M/sec                    ( +-  0.00% )
 3,000,084,446      uops_issued_any:u         # 3897.783 M/sec                    ( +-  0.00% )
 3,000,058,661      uops_executed_thread:u    # 3897.750 M/sec                    ( +-  0.00% )
Run Code Online (Sandbox Code Playgroud)

This takes 3G cycles for 1G iterations, because the length of the dependency chain is now 3 cycles.

The fused-domain uop count didn't change, still 3G.

What did change is that now the unfused-domain uop count is the same as fused-domain. All the uops needed an execution unit; none of the mov same,same instructions were eliminated, so they all added 1c latency to the loop-carried dep chain.

(When there are micro-fused uops, like vmovdqa xmm,xmm, the movzx eax,al count can be higher than mov. But we don't have that.)


Without the mov at all:

lea ecx, [rcx-1]

dec ecx
jnz .loop


 2,000,131,323      cycles:u                  #    3.896 GHz                      ( +-  0.00% )
 3,000,000,161      instructions:u            #    1.50  insn per cycle         
 1,000,000,157      branches:u                # 1947.876 M/sec                  
 2,000,055,428      uops_issued_any:u         # 3895.859 M/sec                    ( +-  0.00% )
 2,000,039,061      uops_executed_thread:u    # 3895.828 M/sec                    ( +-  0.00% )
Run Code Online (Sandbox Code Playgroud)

Now we're back down to 2 cycle latency for the loop-carried dep chain.

Nothing is eliminated.


I tested on a 3.9GHz i7-6700k Skylake. I get identical results on a Haswell i5-4210U (to within 40k out of 1G counts) for all the perf events. That's about the same margin of error as re-running on the same system.

Note that if I ran add eax, [rsi] as root, and counted uops_executed instead of uops_issued (user-space only), it measure the CPU frequency as exactly 3.900 GHz. (IDK why Linux only obeys the bios-settings for max turbo right after reboot, but then drops to 3.9GHz if I leave it idle for a couple minutes. Asus Z170 Pro Gaming mobo, Arch Linux with kernel 4.10.11-1-ARCH. Saw the same thing with Ubuntu. Writing mov to each of perf from cycles fixes it, but writing cycles:u makes it drop back to 3.9GHz again later.)


You should get the same results on AMD Ryzen, since it can eliminate integer balance_performance. AMD Bulldozer-family can only eliminate xmm register copies. (According to Agner Fog, /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference register copies are an eliminated low-half and an ALU op for the high half.)

For example, AMD Bulldozer and Intel Ivybridge can sustain a throughput of 1 per clock for

 movaps  xmm0, xmm1
 movaps  xmm2, xmm3
 movaps  xmm4, xmm5
 dec
 jnz .loop
Run Code Online (Sandbox Code Playgroud)

But Intel Sandybridge can't eliminate moves, so it would bottleneck on 4 ALU uops for 3 execution ports. If it was /etc/rc.local instead of movaps, SnB could also sustain one iteration per clock. (But Bulldozer-family couldn't, because xor-zeroing still needs an execution unit on AMD, even though it's independent of the old value of the register. And Bulldozer-family only has 0.5c throughput for PXOR.)


Limitations of mov-elimination

Two dependent MOV instructions in a row exposes a difference between Haswell and Skylake.

.loop:
  mov eax, ecx
  mov ecx, eax

  sub ecx, 2
  jnz .loop
Run Code Online (Sandbox Code Playgroud)

Haswell: minor run-to-run variability (1.746 to 1.749 c/iter), but this is typical:

 1,749,102,925      cycles:u                  #    2.690 GHz                    
 4,000,000,212      instructions:u            #    2.29  insn per cycle         
 1,000,000,208      branches:u                # 1538.062 M/sec                  
 3,000,079,561      uops_issued_any:u         # 4614.308 M/sec                  
 1,746,698,502      uops_executed_core:u      # 2686.531 M/sec                  
   745,676,067      lsd_cycles_4_uops:u       # 1146.896 M/sec                  
Run Code Online (Sandbox Code Playgroud)

Not all the MOV instructions are eliminated: about 0.75 of the 2 per iteration used an execution port. Every MOV that executes instead of being eliminated adds 1c of latency to the loop-carried dep chain, so it's not a coincidence that balance_power and sudo perf are very similar. All the uops are part of a single dependency chain, so there's no parallelism possible. kernel.perf_event_paranoid = 0 is always about 5M higher than /etc/syctl.d/99-local.conf regardless of run-to-run variation, so I guess there are just 5M cycles being used up somewhere else.

Skylake: more stable than HSW results, and more mov-elimination: only 0.6666 MOVs of every 2 needed an execution unit.

 1,666,716,605      cycles:u                  #    3.897 GHz
 4,000,000,136      instructions:u            #    2.40  insn per cycle
 1,000,000,132      branches:u                # 2338.050 M/sec
 3,000,059,008      uops_issued_any:u         # 7014.288 M/sec
 1,666,548,206      uops_executed_thread:u    # 3896.473 M/sec
   666,683,358      lsd_cycles_4_uops:u       # 1558.739 M/sec
Run Code Online (Sandbox Code Playgroud)

On Haswell, mov accounted for all of the uops. (0.745*4 ~= 3). So in almost every cycle where any uops are issued, a full group of 4 is issued (from the loop-buffer. I probably should have looked at a different counter that doesn't care where they came from, like ymm to count cycles where no uops issued).

But on SKL, pxor xmm0,xmm0 is less than 3, so in some cycles the front-end issued fewer than 4 uops. (Usually it stalls until there's room in the out-of-order core to issue a full group of 4, instead of issuing non-full groups).

It's weird, IDK what the exact microarchitectural limitation is. Since the loop is only 3 uops, each issue-group of 4 uops is more than a full iteration. So an issue group can contain up to 3 dependent MOVs. Perhaps Skylake is designed to break that up sometimes, to allow more mov-elimination?

update: actually this is normal for 3-uop loops on Skylake. uops_executed shows that HSW and SKL issue a simple 3 uop loop with no mov-elimination the same way they issue this one. So better mov-elimination is a side-effect of splitting up issue groups for some other reason. (It's not a bottleneck because taken branches can't execute faster than 1 per clock regardless of how fast they issue). I still don't know why SKL is different, but I don't think it's anything to worry about.


In a less extreme case, SKL and HSW are the same, with both failing to eliminate 0.3333 of every 2 MOV instructions:

.loop:
  mov eax, ecx
  dec eax
  mov ecx, eax

  sub ecx, 1
  jnz .loop
Run Code Online (Sandbox Code Playgroud)

 2,333,434,710      cycles:u                  #    3.897 GHz                    
 5,000,000,185      instructions:u            #    2.14  insn per cycle         
 1,000,000,181      branches:u                # 1669.905 M/sec                  
 4,000,061,152      uops_issued_any:u         # 6679.720 M/sec                  
 2,333,374,781      uops_executed_thread:u    # 3896.513 M/sec                  
 1,000,000,942      lsd_cycles_4_uops:u       # 1669.906 M/sec                  
Run Code Online (Sandbox Code Playgroud)

All of the uops issue in groups of 4. Any contiguous group of 4 uops will contain exactly two MOV uops that are candidates for elimination. Since it clearly succeeds at eliminating both in some cycles, IDK why it can't always do that.


Intel's optimization manual says that overwriting the result of mov-elimination as early as possible frees up the microarchitectural resources so it can succeed more often, at least for cycles. See Example 3-25. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions.

So maybe it's tracked internally with a limited-size table of ref-counts? Something has to stop the physical register file entry from being freed when it's no longer needed as the value of the original architectural register, if it's still needed as the value of the mov destination. Freeing PRF entries as soon as possible is key, because PRF size can limit the out-of-order window to smaller than ROB size.

I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the cycles instructions.

See also Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for more research + guesswork about how mov-elimination works, and whether it could work for uops_executed. (In practice lsd.cycles_4_uops is 3 ALU uops on Intel, but 2 eliminated uops on Ryzen. It's interesting to guess whether Intel could have implemented it more efficiently.)


顺便说一句,作为Haswell的错误解决方案,Linux不提供uops_issued.stall_cycles何时启用超线程0.66666 * 4 = 2.66664.另一个核心绝对是空闲的,甚至没有计时器中断,因为我把它脱机了uops_issued.stall_cycles.不幸的是,在movzx决定启用HT 之前无法完成此操作,而且我的戴尔笔记本电脑没有禁用HT的BIOS选项.所以我无法movzx在该系统上同时使用所有8个硬件PMU计数器,只有4.:/

  • +1 很好的答案!其中一些实际上让我无法理解(例如,我以前从未听说过“融合域”),但我想我已经掌握了发生了什么。谢谢! (2认同)
  • 是的,我很确定我理解它.你说dec + jnz被融合到1个操作中,所以如果删除了mov,你就有2个操作每4个指令运行一次,每个操作需要一个周期,给出2.00个ins/cycle,类似于1.33和1.50例.我同意,2%肯定很好奇.但这是一个非常好的答案; 我会在某个时候接受它,只是并不着急.谢谢你的写作. (2认同)
  • @JDługosz:`movzx eax, bl` 是 8 到 64。32 -&gt; 64 部分是隐式写入 32 位寄存器 (/sf/ask/782399621/指令将 32 位寄存器的上半部分清零)。编写 `movzx rax, bl` 会使代码变大(REX 前缀)而没有任何好处。 (2认同)
  • @BeeOnRope:哦,FFS Intel,更好地测试您的 CPU,这样我们就不必继续解决缓解措施带来的性能问题。特别是因为英特尔对 IvyBridge 的优化建议是立即覆盖“mov”的结果以释放“mov”消除资源,从而使“mov”更有可能位于关键路径上而不进行消除。(编译器似乎更喜欢在制作副本后对副本而不是原始版本进行更多操作。) (2认同)

har*_*old 11

以下是两个小测试,我相信这些测试最终会显示出消除移动的证据:

__loop1:
    add edx, 1
    add edx, 1
    add ecx, 1
    jnc __loop1
Run Code Online (Sandbox Code Playgroud)

__loop2:
    mov eax, edx
    add eax, 1
    mov edx, eax
    add edx, 1
    add ecx, 1
    jnc __loop2
Run Code Online (Sandbox Code Playgroud)

如果mov向依赖关系链添加一个循环,则预计第二个版本每次迭代需要大约4个循环.在我的Haswell上,每次迭代都需要大约2个周期,没有mov-elimination就不会发生.

  • @Mehrdad因为`mov`现在位于依赖链中,所以如果它们有延迟,则必须加起来.在你的测试用例中,`mov`只是在链的末尾晃来晃去,没有什么可以等待它发生.它可能被淘汰或没有,没有办法说出来. (4认同)
  • @Mehrdad时间不同,是的.但是延迟只能是(inb4 Netburst及其奇怪的双泵ALU)是整数个周期,因此`mov`要么增加一个周期,要么不增加(在这种情况下它必须被消除).仅仅存在*其他*(更微妙)的效果,实际上是无关的.当然,你确实存在这些影响是绝对正确的. (3认同)
  • @Mehrdad哦对不起是的,平均延迟可以是非整数.假设正在发生的事情*偶尔*无法消除mov,你甚至可以说延迟平均为一些低但非零的数字.AFAIK它只是由于其他效果,但它总是值得一试.E:例如,如果我的第二个例子的一致小罚款显着改变,如果"其他无害垃圾"放在那里而不是movs,这可能表明在那个方向有趣. (2认同)

归档时间:

查看次数:

2113 次

最近记录:

6 年,6 月 前