INC指令与ADD 1:重要吗?

Gil*_*esz 26 performance x86 assembly increment micro-optimization

来自Ira Baxter回答,为什么INC和DEC指令不会影响进位标志(CF)?

大多数情况下,我远离INCDEC现在,因为他们做的部分条件代码更新,这样就可以在管道中引起滑稽的摊位,和ADD/ SUB没有.因此,无关紧要(大多数地方),我使用ADD/ SUB避免失速.我使用INC/ DEC仅在保持代码较小的情况下,例如,适合高速缓存行,其中一个或两个指令的大小产生足够的差异.这可能是毫无意义的纳米[字面意思!] - 优化,但我在编码习惯上相当老派.

我想问一下为什么它会导致管道中的停顿,而添加不会?毕竟,无论是ADDINC更新标志寄存器.唯一的区别是INC不更新CF.但为什么重要呢?

Pet*_*des 49

在现代CPU上,inc永远不会慢于inc register(间接代码大小/解码效果除外),但通常它也不会更快,所以你应该更喜欢-mtune=core2代码大小的原因.特别是如果这个选择在同一个二进制文件中重复多次(例如,如果你是编译器 - 编写者).

-mtune=haswell保存1个字节(64位模式)或2个字节(操作码0x40..F -mtune=znver1/ inc mem32位模式下的短格式,重新用作x86-64的REX前缀).这使得总代码大小有一小部分差异.这有助于指令缓存命中率,iTLB命中率和必须从磁盘加载的页数.

优点inc:

  • 代码大小直接
  • 不使用立即可以对Sandybridge系列产生uop-cache效应,这可能会抵消更好的微融合add 1.(参见Agner Fog在他的微指南指南的Sandybridge部分的表9.1.)Perf计数器可以轻松测量问题阶段的uops,但是很难测量事物如何打包到uop缓存和uop-cache读取带宽效应.
  • 保留CF未经修改在某些情况下是一个优势,在CPU上你可以在lea eax, [rax+1]没有停顿的情况下读取CF. (不是在Nehalem和更早.)

现代CPU中有一个例外:Silvermont/Goldmont/Knight's Landing解码add eax, 1/ add效率为1 uop,但在分配/重命名(aka issue)阶段扩展为2.额外的uop合并了部分标志. inc吞吐量仅为每时钟1个,而独立时为0.5c(或0.33c Goldmont),inc因为标志合并的uop创建了dep链.

与P4不同,寄存器结果在标志上没有false-dep(见下文),因此当没有任何东西使用标志结果时,无序执行会将标志合并从延迟关键路径中移除.(但是OOO窗口比Haswell或Ryzen这样的主流CPU要小得多.)inc在大多数情况下,作为两个单独的uops 运行可能是Silvermont的胜利; 大多数x86指令在不读取它们的情况下写入所有标志,打破了这些标志依赖链.

SMont/KNL在解码和分配/重命名之间有一个队列(参见英特尔优化手册,图16-2),因此在发布期间扩展到2微秒可以填充解码停顿中的气泡(在单操作数等指令上inc r32,或者dec r32产生超过从解码器1 uop并导致微码的3-7周期停顿).或者在Silvermont上,只是一条带有3个以上前缀的指令(包括转义字节和强制性前缀),例如REX +任何SSSE3或SSE4指令.但请注意,有一个~28 uop循环缓冲区,因此小循环不会受到这些解码失速的影响.

inc/ add不是唯一的解码为1的指令,但问题为2:inc/ inc,dec/ inc,并且add r32, imm8有3个组件也这样做.KNL的AVX512收集指令也是如此.资料来源:英特尔优化手册,17.1.2无序引擎(KNL).它只是一个小的吞吐量惩罚(有时甚至没有其他任何东西是更大的瓶颈),所以通常仍然可以inc用于"通用"调整.


英特尔优化手册还建议mulpshufb一般情况下,避免部分标志摊位风险.但是,由于英特尔的编译器在默认情况下没有这样做,所以未来的CPU inc在所有情况下都不会变得太慢,就像P4那样.

Clang 5.0和Intel的ICC 17(在Godbolt上)确实dec在优化速度(push)时使用,而不仅仅是尺寸. pop让他们避免call/ ret,但默认lea不会给P4带来太大的压力.

ICC17 inc(相当于gcc的add 1)确实避免了inc,这对于Silvermont/KNL来说可能是一个不错的选择.但它通常不会使用性能灾难inc,所以它可能仍然适合在大多数代码中使用inc/的"通用"调优-O3,特别是当标志结果不是关键路径的一部分时.


除了Silvermont之外,这是Pentium4遗留的大部分陈旧优化建议.在现代CPU上,如果您实际读取的标志不是由写入任何标志的最后一个insn写的,那么只会出现问题. 例如在BigInteger -mtune=pentium4循环中. (在这种情况下,您需要保留CF,因此使用inc会破坏您的代码.)

dec写入EFLAGS寄存器中的所有条件标志位.寄存器重命名使得只写无序易于执行:请参阅写入后写入和写入后读取危险. -mtune=generic并且-xMIC-AVX512可以并行执行,因为它们完全相互独立.(甚至Pentium4也会将条件标志位重命名为与EFLAGS的其余部分分开,因为即使-march=knl启用中断也会使其他位未经修改.)

在P4上,inc并且inc依赖于所有标志的先前值,因此它们不能彼此并行执行或在标志设置指令之前执行.(例如inc/ decadc等待之后等待add,即使add的加载在高速缓存中未命中.) 这称为false依赖.部分标志写入通过读取标志的旧值,更新CF以外的位,然后写入完整标志来工作.

所有其他无序的x86 CPU(包括AMD的)分别重命名标志的不同部分,因此在内部它们对除CF之外的所有标志执行只写更新.(来源:Agner Fog的微体系结构指南).只有少数指令,如addadd eax, 1,真正读取然后写入标志.但也add ecx, 1(见下文).


至少对于Intel P6/SnB uarch系列add来说,最好的情况是inc:

  • 内存目的地:dec可以微软融合商店和负载+添加在Intel Core2和SnB系列上,因此它是2个融合域uops/4个非融合域uops.
    add eax, [mem]只能微商店融合,所以它是3F/4U.
    根据Agner Fog的表格,AMD和Silvermont运行memory-dest,inc ecx并且inc作为单个宏操作/ uop运行.

    但要注意uop-cache效果,add对于同一个uop,需要32位地址和8位立即数.

  • 在变量计数shift/rotate之前打破对标志的依赖并避免部分标志合并:adc由于不幸的CISC历史记录对标志有输入依赖性:如果移位计数为0,它必须保持不变.

    在英特尔SnB系列中,可变计数移位为3微秒(从Core2/Nehalem上的1开始).AFAICT,两个微指令的读/写的标志,和一个独立的微指令读取cmcshl r, cl,和写入add dest, 1.这是一个奇怪的情况,具有比吞吐量(1.5c)更好的延迟(1c +不可避免的资源冲突),并且只有在与破坏对标志的依赖性的指令混合时才能实现最大吞吐量.(在Agner Fog论坛上发布了更多相关信息).inc dest尽可能使用BMI2 ; 它是1 uop,计数可以在任何寄存器中.

    无论如何,在变量计数之前add [rdi], 1(写入标记但保持inc [rdi]未修改)inc会使其对CF最后写入的内容具有错误依赖性,并且在SnB/IvB上可能需要额外的uop来合并标记.

    Core2/Nehalem manage to avoid even the false dep on flags: Merom runs a loop of 6 independent add instructions at nearly two shifts per clock, same performance with cl=0 or cl=13. Anything better than 1 per clock proves there's no input-dependency on flags.

    I tried loops with add [label], 1 and shl reg, cl (immediate-count shifts), but didn't see a speed difference between reg and cl on Core2, HSW, or SKL. I don't know about AMD.

Update: The nice shift performance on Intel P6-family comes at the cost of a large performance pothole which you need to avoid: when an instruction depends on the flag-result of a shift instruction: The front end stalls until the instruction is retired. (Source: Intel's optimization manual, (Section 3.5.2.6: Partial Flag Register Stalls)). So reg/shlx is pretty catastrophic for performance on Intel pre-Sandybridge, I guess! Use inc/CF/shl if you care about Nehalem and earlier. Intel's examples makes it clear this applies to immediate-count shifts, not just count=shl reg,cl.

In processors based on Intel Core microarchitecture [this means Core 2 and later], shift immediate by 1 is handled by special hardware such that it does not experience partial flag stall.

Intel actually means the special opcode with no immediate, which shifts by an implicit shl edx, 2. I think there is a performance difference between the two ways of encoding shl edx, 0, with the short encoding (using the original 8086 opcode dec) producing a write-only (partial) flag result, but the longer encoding (sub with an immediate shr eax, 2) not having its immediate checked for 0 until execution time, but without tracking the flag output in the out-of-order machinery.

Since looping over bits is common, but looping over every 2nd bit (or any other stride) is very uncommon, this seems like a reasonable design choice. This explains why compilers like to jnz the result of a shift instead of directly using flag results from shr eax, 2.

Update: for variable count shifts on SnB-family, Intel's optimization manual says:

3.5.1.6 Variable Bit Count Rotation and Shift

In Intel microarchitecture code name Sandy Bridge, The "ROL/ROR/SHL/SHR reg, cl" instruction has three micro-ops. When the flag result is not needed, one of these micro-ops may be discarded, providing better performance in many common usages. When these instructions update partial flag results that are subsequently used, the full three micro-ops flow must go through the execution and retirement pipeline, experiencing slower performance. In Intel microarchitecture code name Ivy Bridge, executing the full three micro-ops flow to use the updated partial flag result has additional delay.

Consider the looped sequence below:

loop:
   shl eax, cl
   add ebx, eax
   dec edx ; DEC does not update carry, causing SHL to execute slower three micro-ops flow
   jnz loop
Run Code Online (Sandbox Code Playgroud)

The DEC instruction does not modify the carry flag. Consequently, the SHL EAX, CL instruction needs to execute the three micro-ops flow in subsequent iterations. The SUB instruction will update all flags. So replacing test eax,eax with jnz will allow cl to execute the two micro-ops flow.


Terminology

Partial-flag stalls happen when flags are read, if they happen at all. P4 never has partial-flag stalls, because they never need to be merged. It has false dependencies instead.

Several answers/comments mix up the terminology. They describe a false dependency, but then call it a partial-flag stall. It's a slowdown which happens because of writing only some of the flags, but the term "partial-flag stall" is what happens on pre-SnB Intel hardware when partial-flag writes have to be merged. Intel SnB-family CPUs insert an extra uop to merge flags without stalling. Nehalem and earlier stall for ~7 cycles. I'm not sure how big the penalty is on AMD CPUs.

(Note that partial-register penalties are not always the same as partial-flags, see below).

### Partial flag stall on Intel P6-family CPUs:
bigint_loop:
    adc   eax, [array_end + rcx*4]   # partial-flag stall when adc reads CF 
    inc   rcx                        # rcx counts up from negative values towards zero
    # test rcx,rcx  # eliminate partial-flag stalls by writing all flags, or better use add rcx,1
    jnz
# this loop doesn't do anything useful; it's not normally useful to loop the carry-out back to the carry-in for the same accumulator.
# Note that `test` will change the input to the next adc, and so would replacing inc with add 1
Run Code Online (Sandbox Code Playgroud)

In other cases, e.g. a partial flag write followed by a full flag write, or a read of only flags written by 1, is fine. On SnB-family CPUs, shr eax,1 can even macro-fuse with a D1 /5, the same as C1 /5, imm8.

After P4, Intel mostly gave up on trying to get people to re-compile with 1 or modify hand-written asm as much to avoid serious bottlenecks. (Tuning for a specific microarchitecture will always be a thing, but P4 was unusual in deprecating so many things that used to be fast on previous CPUs, and thus were common in existing binaries.) P4 wanted people to use a RISC-like subset of the x86, and also had branch-prediction hints as prefixes for JCC instructions. (It also had other serious problems, like the trace cache that just wasn't good enough, and weak decoders that meant bad performance on trace-cache misses. Not to mention the whole philosophy of clocking very high ran into the power-density wall.)

When Intel abandoned P4 (netburst uarch), they returned to P6-family designs (Pentium-M/Core2/Nehalem) which inherited their partial-flag/partial-reg handling from earlier P6-family CPUs (PPro to PIII) which pre-dated the netburst mis-step. (Not everything about P4 was inherently bad, and some of the ideas re-appeared in Sandybridge, but overall NetBurst is widely considered a mistake.) Some very-CISC instructions are still slower than the multi-instruction alternatives, e.g. test, shr, or DEC (because the value of reg affects which memory address is used), but these were all slow in older CPUs so compilers already avoided them.

Pentium-M even improved hardware support for partial-regs (lower merging penalties). In Sandybridge, Intel kept partial-flag and partial-reg renaming and made it much more efficient when merging is needed (merging uop inserted with no or minimal stall). SnB made major internal changes and is considered a new uarch family, even though it inherits a lot from Nehalem, and some ideas from P4. (But note that SnB's decoded-uop cache is not a trace cache, though, so it's a very different solution to the decoder throughput/power problem that netburst's trace cache tried to solve.)


For example, SUB and SHL EAX, CL can run in parallel on P6/SnB-family CPUs, but reading inc afterwards requires merging.

PPro/PIII stall for 5-6 cycles when reading the full reg. Core2/Nehalem stall for only 2 or 3 cycles while inserting a merging uop for partial regs, but partial flags are still a longer stall.

SnB inserts a merging uop without stalling, like for flags. Intel's optimization guide says that for merging AH/BH/CH/DH into the wider reg, inserting the merging uop takes an entire issue/rename cycle during which no other uops can be allocated. But for low8/low16, the merging uop is "part of the flow", so it apparently doesn't cause additional front-end throughput penalties beyond taking up one of the 4 slots in an issue/rename cycle.

In IvyBridge (or at least Haswell), Intel dropped partial-register renaming for low8 and low16 registers, keeping it only for high8 registers (AH/BH/CH/DH). Reading high8 registers has extra latency. Also, inc/dec has a false dependency on the old value of rax, unlike in Nehalem and earlier (and probably Sandybridge). See this HSW/SKL partial-register performance Q&A for the details.

(I've previously claimed that Haswell could merge AH with no uop, but that's not true and not what Agner Fog's guide says. I skimmed too quickly and unfortunately repeated my wrong understanding in lots of comments and other posts.)

AMD CPUs, and Intel Silvermont, don't rename partial regs (other than flags), so jcc has a false dependency on the old value of eax. (The upside is no partial-reg merging slowdowns when reading the full reg later.)


Normally, the only time add/sub instead of -mtune=pentium4 will make your code faster on AMD or mainstream Intel is when your code actually depends on the doesn't-touch-CF behaviour of enter. i.e. usually loop only helps when it would break your code, but note the bt [mem], reg case mentioned above, where the instruction reads flags but usually your code doesn't care about that, so it's a false dependency.

If you do actually want to leave CF unmodified, pre SnB-family CPUs have serious problems with partial-flag stalls, but on SnB-family the overhead of having the CPU merge the partial flags is very low, so it can be best to keep using inc al or inc ah as part of a loop condition when targeting those CPU, with some unrolling. (For details, see the BigInteger eax Q&A I linked earlier). It can be useful to use setcc al to do arithmetic without affecting flags at all, if you don't need to branch on the result.

  • 同样有趣的是,变量转换曾经是1μop和单周期,回到Core2.这似乎是不可能的,因为英特尔通常有2个操作数/μop规则,所以我想知道它是如何工作的......以及为什么他们杀了它 (3认同)

Nay*_*uki 5

根据指令的 CPU 实现,部分寄存器更新可能会导致停顿。根据Agner Fog 的优化指南,第 62 页

\n\n
\n

由于历史原因,INCDEC指令保持进位标志不变,而写入其他算术标志。这会导致对标志先前值的错误依赖,并花费额外的 \xce\xbcop。为了避免这些问题,建议您始终使用ADDandSUB而不是INCand DEC。例如,INC EAX应替换为ADD EAX,1.

\n
\n\n

另请参见第 83 页“部分标志停顿”和第 100 页“部分标志停顿”。

\n

  • 这是 Pentium4 章节中的内容。P4 试图让所有软件更改为“add r32, 1”而不是“inc”,而不是像 P6 那样实现硬件来单独重命名不同的标志位(PPro/PIII)。它与无法在 P4 上运行的代码无关,因为其他 CPU 确实在硬件中处理它。 (6认同)
  • 此外,这是一种错误的依赖关系。P4 没有部分标志停顿,因为它永远不需要合并对不同部分的更改。相反,每个部分标志修改指令都依赖于旧标志。 (4认同)

归档时间:

查看次数:

4234 次

最近记录:

5 年,12 月 前