iBu*_*Bug 26 optimization performance assembly loops micro-optimization
当试图理解汇编(启用编译器优化)时,我看到这种行为:
这样一个非常基本的循环
outside_loop;
while (condition) {
statements;
}
Run Code Online (Sandbox Code Playgroud)
经常被编译成(伪代码)
; outside_loop
jmp loop_condition ; unconditional
loop_start:
loop_statements
loop_condition:
condition_check
jmp_if_true loop_start
; outside_loop
Run Code Online (Sandbox Code Playgroud)
但是,如果未打开优化,则会编译为通常可理解的代码:
loop_condition:
condition_check
jmp_if_false loop_end
loop_statements
jmp loop_condition ; unconditional
loop_end:
Run Code Online (Sandbox Code Playgroud)
根据我的理解,编译后的代码更像是这样的:
goto condition;
do {
statements;
condition:
}
while (condition_check);
Run Code Online (Sandbox Code Playgroud)
我看不到巨大的性能提升或代码可读性提升,为什么经常出现这种情况呢?是否有此循环样式的名称,例如"尾随条件检查"?
Pet*_*des 37
Fewer instructions/uops inside the loop = better. Structuring the code outside the loop to achieve this is very often a good idea, whether it's branches to allow a better loop structure, or partially peeling the first iteration so you can fall into the loop at a convenient point and maybe save a while(1) {... ; if(x)break; ...; } instruction or whatever inside it (not sure if there's a name for that; "software pipelining" is something specific).
do{}while()是所有体系结构中asm循环的规范/惯用结构,习惯它. IDK,如果有名称; 我会说这样的循环有一个"做结构".如果你想要名字,你可以调用while()结构"糟糕的未经优化的代码"或"由新手编写".:P底部的循环分支是通用的,甚至不值得一提的循环优化.你总是那样做.
这种模式被如此广泛地使用,以至于在分支预测器缓存中没有条目的情况下对分支使用静态分支预测的CPU,未预测未知的前向条件分支,预测未知的向后分支(因为它们可能是循环分支) ).请参阅Matt Godbolt博客中关于较新英特尔处理器的静态分支预测,以及Agner Fog在其微代PDF开始时的分支预测章节.
This answer ended up using x86 examples for everything, but much of this applies across the board for all architectures. I wouldn't be surprised if other superscalar/out-of-order implementations (like some ARM, or POWER) also have limited branch-instruction throughput whether they're taken or not. But fewer instructions inside the loop is nearly universal when all you have is a conditional branch at the bottom, and no unconditional branch.
If the loop might need to run zero times, compilers more often put a test-and-branch outside the loop to skip it, instead of jumping to the loop condition at the bottom. (i.e. if the compiler can't prove the loop condition is always true on the first iteration).
BTW, this paper calls transforming while() to if(){ do{}while; } an "inversion", but loop inversion usually means inverting a nested loop. (e.g. if the source loops over a row-major multi-dimensional array in the wrong order, a clever compiler could change for(i) for(j) a[j][i]++; into for(j) for(i) a[j][i]++; if it can prove it's correct.) But I guess you can look at the if() as a zero-or-one iteration loop. Fun fact, compiler devs teaching their compilers how to invert a loop (to allow auto-vectorization) for a (very) specific case is why SPECint2006's libquantum benchmark is "broken". Most compilers can't invert loops in the general case, just ones that looks almost exactly like the one in SPECint2006...
do{}while()当你知道不允许调用者通过时,你可以通过在C中编写循环来帮助编译器生成更紧凑的asm(循环外的指令更少),size=0或者保证循环运行至少一次.
(对于有符号循环边界,实际上是0或负数.有符号和无符号循环计数器是一个棘手的优化问题,特别是如果选择比指针更窄的类型;检查编译器的asm输出以确保它不是符号扩展的窄循环如果你将它用作数组索引,那么在循环内部计时器.但请注意,signed实际上可能有用,因为编译器可以假设i++ <= bound最终会变为false,因为有符号溢出是UB但是unsigned不是.所以使用unsigned,while(i++ <= bound)如果是无限的bound = UINT_MAX.)我没有一个关于何时使用有符号和无符号的一揽子建议; size_t 但是,如果你想在循环开销中避免使用x86-64 REX前缀(为了节省代码大小),但是说服编译器不要浪费任何指令零或签名,那么它通常是循环数组的一个很好的选择.延伸,这可能很棘手.
我看不出巨大的性能提升
下面是一个示例,其中优化将在Haswell之前在Intel CPU上加速2倍,因为P6和SnB/IvB只能在端口5上运行分支,包括未采用的条件分支.
这种静态性能分析所需的背景知识:Agner Fog的微型指南(阅读Sandybridge部分).另外阅读他的优化装配指南,它非常棒.(但偶尔会在某些地方过时.)另请参阅x86标记wiki 中的其他x86性能链接.另见x86的MOV真的可以"免费"吗?为什么我不能重现这个呢?通过perf计数器实验支持的一些静态分析,以及融合域与未融合域uops的一些解释.
您还可以使用英特尔的IACA软件(英特尔架构代码分析器)对这些循环进行静态分析.
; sum(int []) using SSE2 PADDD (dword elements)
; edi = pointer, esi = end_pointer.
; scalar cleanup / unaligned handling / horizontal sum of XMM0 not shown.
; NASM syntax
ALIGN 16 ; not required for max performance for tiny loops on most CPUs
.looptop: ; while (edi<end_pointer) {
cmp edi, esi ; 32-bit code so this can macro-fuse on Core2
jae .done ; 1 uop, port5 only (macro-fused with cmp)
paddd xmm0, [edi] ; 1 micro-fused uop, p1/p5 + a load port
add edi, 16 ; 1 uop, p015
jmp .looptop ; 1 uop, p5 only
; Sandybridge/Ivybridge ports each uop can use
.done: ; }
Run Code Online (Sandbox Code Playgroud)
这是4个完整的融合域uop(具有宏的融合cmp/jae),因此它可以在每个时钟的一次迭代中从前端发布到无序核心.但是在未融合的域中有4个ALU uop,而Intel前Haswell只有3个ALU端口.
更重要的是,port5压力是瓶颈:这个循环每2个循环只能执行一次迭代,因为cmp/jae和jmp都需要在port5上运行.窃取port5的其他uops可能会降低实际吞吐量.
以asm的形式写出循环,我们得到:
ALIGN 16
.looptop: ; do {
paddd xmm0, [edi] ; 1 micro-fused uop, p1/p5 + a load port
add edi, 16 ; 1 uop, p015
cmp edi, esi ; 1 uop, port5 only (macro-fused with cmp)
jb .looptop ; } while(edi < end_pointer);
Run Code Online (Sandbox Code Playgroud)
Notice right away, independent of everything else, that this is one fewer instruction in the loop. This loop structure is at least slightly better on everything from simple non-pipelined 8086 through classic RISC (like early MIPS), especially for long-running loops (assuming they don't bottleneck on memory bandwidth).
Core2 and later should run this at one iteration per clock, twice as fast as the while(){}-structured loop, if memory isn't a bottleneck (i.e. assuming L1D hits, or at least L2 actually; this is only SSE2 16-bytes per clock).
This is only 3 fused-domain uops, so can issue at better than one per clock on anything since Core2, or just one per clock if issue groups always end with a taken branch.
但重要的是,port5的压力大大减少:只cmp/jb需要它.其他uops可能会在某些时间被安排到port5并从循环分支吞吐量中窃取周期,但这将是几个百分比而不是因子2.请参阅如何安排x86 uops?.
大多数通常具有每2个周期一个分支吞吐量的CPU仍然可以执行每个时钟1个的微小循环.但是有一些例外.(我忘记哪些CPU不能在每个时钟1个运行紧密的循环;也许Bulldozer系列?或者可能只是像VIA Nano这样的低功耗CPU.)Sandybridge和Core2绝对可以在每个时钟运行一个紧密的循环.他们甚至有循环缓冲区; 在指令长度解码之后但在常规解码之前,Core2有一个循环缓冲区.Nehalem然后在队列中回收uops,为进给问题/重命名阶段提供支持.(除了具有微代码更新的Skylake;由于部分注册合并错误,英特尔不得不禁用循环缓冲区.)
However, there is a loop-carried dependency chain on xmm0: Intel CPUs have 1-cycle latency paddd, so we're right up against that bottleneck, too. add esi, 16 is also 1 cycle latency. On Bulldozer-family, even integer vector ops have 2c latency, so that would bottleneck the loop at 2c per iteration. (AMD since K8 and Intel since SnB can run two loads per clock, so we need to unroll anyway for max throughput.) With floating point, you definitely want to unroll with multiple accumulators. Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?.
If I'd used an indexed addressing mode, like paddd xmm0, [edi + eax], I could have used sub eax, 16/jnc at the loop condition. SUB/JNC can macro-fuse on Sandybridge-family, but the indexed load would un-laminate on SnB/IvB (but stay fused on Haswell and later, unless you use the AVX form).
; index relative to the end of the array, with an index counting up towards zero
add rdi, rsi ; edi = end_pointer
xor eax, eax
sub eax, esi ; eax = -length, so [rdi+rax] = first element
.looptop: ; do {
paddd xmm0, [rdi + rax]
add eax, 16
jl .looptop ; } while(idx+=16 < 0); // or JNC still works
Run Code Online (Sandbox Code Playgroud)
(It's usually better to unroll some to hide the overhead of pointer increments instead of using indexed addressing modes, especially for stores, partly because indexed stores can't use the port7 store AGU on Haswell+.)
On Core2/Nehalem add/jl don't macro-fuse, so this is 3 fused-domain uops even in 64-bit mode, without depending on macro-fusion. Same for AMD K8/K10/Bulldozer-family/Ryzen: no fusion of the loop condition, but PADDD with a memory operand is 1 m-op/uop.
On SnB, paddd un-laminates from the load, but add/jl macro-fuse, so again 3 fused-domain uops. (But in the unfused domain, only 2 ALU uops + 1 load, so probably fewer resource conflicts reducing throughput of the loop.)
On HSW and later, this is 2 fused-domain uops because an indexed load can stay micro-fused with PADDD, and add/jl macro-fuses. (Predicted-taken branches run on port 6, so there are never resource conflicts.)
Of course, the loops can only run at best 1 iteration per clock because of taken branch throughput limits even for tiny loops. This indexing trick is potentially useful if you had something else to do inside the loop, too.
Yes, that exaggerates the effect of loop overhead. But gcc doesn't unroll by default even at -O3 (unless it decides to fully unroll). It only unrolls with profile-guided optimization to let it know which loops are hot. (-fprofile-use). You can enable -funroll-all-loops, but I'd only recommend doing that on a per-file basis for a compilation unit you know has one of your hot loops that needs it. Or maybe even on a per-function basis with an __attribute__, if there is one for optimization options like that.
So this is highly relevant for compiler-generated code. (But clang does default to unrolling tiny loops by 4, or small loops by 2, and extremely importantly, using multiple accumulators to hide latency.)
Consider what happens when the loop body should run once or twice: There's a lot more jumping with anything other than do{}while.
For do{}while, execution is a straight-line with no taken branches and one not-taken branch at the bottom. This is excellent.
For an if() { do{}while; } that might run the loop zero times, it's two not-taken branches. That's still very good. (Not-taken is slightly cheaper for the front-end than taken when both are correctly predicted).
For a jmp-to-the-bottom jmp; do{}while(), it's one taken unconditional branch, one taken loop condition, and then the loop branch is not-taken. This is kinda clunky but modern branch predictors are very good...
For a while(){} structure, this is one not-taken loop exit, one taken jmp at the bottom, then one taken loop-exit branch at the top.
With more iterations, each loop structure does one more taken branch. while(){} also does one more not-taken branch per iteration, so it quickly becomes obviously worse.
The latter two loop structures have more jumping around for small trip counts.
Jumping to the bottom also has a disadvantage for non-tiny loops that the bottom of the loop might be cold in L1I cache if it hasn't run for a while. Code fetch/prefetch is good at bringing code to the front-end in a straight line, but if prediction didn't predict the branch early enough, you might have a code miss for the jump to the bottom. Also, parallel decode will probably have (or could have) decoded some of the top of the loop while decoding the jmp to the bottom.
Conditionally jumping over a do{}while loop avoids all that: you only jump forwards into code that hasn't been run yet in cases where the code you're jumping over shouldn't run at all. It often predicts very well because a lot of code never actually takes 0 trips through the loop. (i.e. it could have been a do{}while, the compiler just didn't manage to prove it.)
Jumping to the bottom also means the core can't start working on the real loop body until after the front-end chases two taken branches.
There are cases with complicated loop conditions where it's easiest to write it this way, and the performance impact is small, but compilers often avoid it.
Consider a memchr loop, or a strchr loop: they have to stop at the end of the buffer (based on a count) or the end of an implicit-length string (0 byte). But they also have to break out of the loop if they find a match before the end.
So you'll often see a structure like
do {
if () break;
blah blah;
} while(condition);
Run Code Online (Sandbox Code Playgroud)
Or just two conditions near the bottom. Ideally you can test multiple logical conditions with the same actual instruction (e.g. 5 < x && x < 25 using sub eax, 5/cmp eax, 20/ja .outside_range, unsigned compare trick for range checking, or combine that with an OR to check for alphabetic characters of either case in 4 instructions) but sometimes you can't and just need to use an if()break style loop-exit branch as well as a normal backwards taken branch.
Matt Godbolt's CppCon2017 talk: "What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid" for good ways to look at compiler output (e.g. what kind of inputs give interesting output, and a primer on reading x86 asm for beginners). related: How to remove "noise" from GCC/clang assembly output?
Modern Microprocessors A 90-Minute Guide!. Details look at superscalar pipelined CPUs, mostly architecture neutral. Very good. Explains instruction-level parallelism and stuff like that.
other links in the x86 tag wiki, including Intel's optimization manuals. Also several of my answers (linked in the tag wiki) have things that Agner missed in his testing on more recent microarchitectures (like un-lamination of micro-fused indexed addressing modes on SnB, and partial register stuff on Haswell+).
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?: how to use multiple accumulators to hide latency of a reduction loop (like an FP dot product).
Lecture 7: Loop Transformations (also on archive.org). Lots of cool stuff that compilers do to loops, using C syntax to describe the asm.
Sort of off topic:
内存带宽几乎总是很重要,但并不是众所周知,大多数现代x86 CPU上的单个内核无法使DRAM饱和,甚至在多线程Xeons上也没有关闭,其中单线程带宽比四核带宽更差双通道内存控制器.
每个程序员应该了解的内存?(我的答案评论了Ulrich Drepper着名的优秀文章中发生的变化和相关内容.)
| 归档时间: |
|
| 查看次数: |
1675 次 |
| 最近记录: |