Bee*_*ope 20 performance x86 assembly cpu-architecture micro-optimization
我想知道各种大小的循环如何在最近的x86处理器上执行,作为uop数的函数.
以下是彼得·科德斯(Peter Cordes)的一句话,他在另一个问题中提出了非多数的问题:
我还发现,如果循环不是4 uop的倍数,则循环缓冲区中的uop带宽不是每个循环的常数4.(即它是abc,abc,......;不是abca,bcab,......).遗憾的是,Agner Fog的microarch doc对循环缓冲区的这种限制并不清楚.
问题是关于循环是否需要是N uop的倍数才能以最大uop吞吐量执行,其中N是处理器的宽度.(即最近的英特尔处理器为4).在谈论"宽度"和计算微动时,有很多复杂因素,但我大多想忽略这些因素.特别是,假设没有微观或宏观融合.
Peter给出了以下一个循环,其中包含7个uop的循环:
一个7-uop循环将发出4 | 3 | 4 | 3 | ...的组我没有测试更大的循环(不适合循环缓冲区),看看是否有可能从下一个指令开始迭代发布在与其分支相同的组中,但我不假设.
更一般地说,声称是x在其体内具有uops 的循环的每次迭代将至少进行ceil(x / 4)迭代,而不是简单地迭代x / 4.
对于部分或全部最新的x86兼容处理器,这是真的吗?
Bee*_*ope 31
我在Linux上perf进行了一些调查,以帮助在我的Skylake i7-6700HQ盒子上回答这个问题,并且Haswell的结果已由另一位用户友情提供.下面的分析适用于Skylake,但接下来是与Haswell的比较.
其他架构可能会有所不同0 - 我欢迎其他结果 - 来源可用).
这个问题主要涉及前端,因为在最近的架构中,它是前端,每个周期强加了四个融合域uop的硬限制.
首先,我将在处理小循环时记住一些"性能规则"时要记住的结果.还有很多其他的性能规则 - 这些规则是对它们的补充(即,你可能不会破坏另一个规则来满足这些规则).这些规则最直接适用于Haswell和后来的体系结构 - 请参阅另一个答案,以概述早期体系结构的差异.
首先,计算循环中宏融合微量的数量.您可以使用Agner的指令表直接查找每条指令,除了ALU uop和紧跟分支通常会融合成一个uop.然后根据这个计数:
4N + 1计数,其次是4N + 2计数.For code served out of the uop cache, there are no apparent multiple-of-4 effects. Loops of any number of uops can be executed at a throughput of 4 fused-domain uops per cycle.
For code processed by the legacy decoders, the opposite is true: loop execution time is limited to integral number of cycles, and hence loops that are not a multiple of 4 uops cannot achieve 4 uops/cycle, as they waste some issue/execution slots.
For code issued from the loop stream detector (LSD), the situation is a mix of the two and is explained in more detail below. In general, loops less than 32 uops and with an even number of uops execute optimally, while odd-sized loops do not, and larger loops require a multiple-of-4 uop count to execute optimally.
Intel actually has a note on this in their optimization manual, details in the other answer.
正如任何精通近期x86-64体系结构的人所知,在任何时候,前端的提取和解码部分可能在几种不同的模式下工作,这取决于代码大小和其他因素.事实证明,这些不同的模式在循环大小调整方面都有不同的行为.我将单独介绍它们.
所述遗留解码器1是完整的机器代码到译码器微指令所使用2当代码不存在于UOP缓存机制(LSD或DSB)配合.这种情况发生的主要原因是代码工作集是否大于uop缓存(在理想情况下大约约为1500微秒,实际上更少).但是,对于此测试,我们将利用如果对齐的32字节块包含超过18条指令3,也将使用传统解码器的事实.
为了测试遗留解码器行为,我们使用如下所示的循环:
short_nop:
mov rax, 100_000_000
ALIGN 32
.top:
dec rax
nop
...
jnz .top
ret
Run Code Online (Sandbox Code Playgroud)
基本上,一个简单的循环计数到rax零为零.所有指令都是单个uop 4,并且nop指令的数量是变化的(在显示的位置...)以测试不同大小的循环(因此4-uop循环将具有2 nops,加上两个循环控制指令).没有宏观的融合,因为我们总是分开dec,并jnz有至少一个nop,也没有微融合.最后,没有内存访问(隐含的icache访问之外).
Note that this loop is very dense - about 1 byte per instruction (since the nop instructions are 1 byte each) - so we'll trigger the > 18 instructions in a 32B chunk condition as soon as hit 19 instructions in the loop. Based on examining the perf performance counters lsd.uops and idq.mite_uops that's exactly what we see: essentially 100% of the instructions come out of the LSD5 up until and including the 18 uop loop, but at 19 uops and up, 100% come from the legacy decoder.
In any case, here are the cycles/iteration for all loop sizes from 3 to 99 uops6:
The blue points are the loops that fit in the LSD, and show somewhat complex behavior. We'll look at these later.
The red points (starting at 19 uops/iteration), are handled by the legacy decoder, and show a very predictable pattern:
N uops take exactly ceiling(N/4) iterationsSo, for the legacy decoder at least, Peter's observation holds exactly on Skylake: loops with a multiple of 4 uops may execute at an IPC of 4, but any other number of uops will waste 1, 2 or 3 execution slots (for loops with 4N+3, 4N+2, 4N+1 instructions, respectively).
It is not clear to me why this happens. Although it may seem obvious if you consider that decoding happens in contiguous 16B chunks, and so at a decoding rate of 4 uops/cycle loops not a multiple of 4 would always have some trailing (wasted) slots in the cycle the jnz instruction is encountered. However, the actual fetch & decode unit is composed of predecode and decode phases, with a queue in-between. The predecode phase actually has a throughput of 6 instructions, but only decodes to the end of the 16-byte boundary on each cycle. This seems to imply that the bubble that occurs at the end of the loop could be absorbed by the predecoder -> decode queue since the predecoder has an average throughput higher than 4.
So I can't fully explain this based on my understanding of how the predecoder works. It may be that there is some additional limitation in decoding or pre-decoding that prevents non-integral cycle counts. For example, perhaps the legacy decoders cannot decode instructions on both sides of a jump even if the instructions after the jump are available in the predecoded queue. Perhaps it is related to the need to handle macro-fusion.
The test above shows the behavior where the top of the loop is aligned on a 32-byte boundary. Below is the same graph, but with an added series that shows the effect when the top of loop is moved 2 bytes up (i.e, now misaligned at a 32N + 30 boundary):
Most loop sizes now suffer a 1 or 2 cycle penalty. The 1 penalty case makes sense when you consider decoding 16B boundaries and 4-instructions per cycle decoding, and the 2 cycle penalty cases occurs for loops where for some reason the DSB is used for 1 instruction in the loop (probably the dec instruction which appears in its own 32-byte chunk), and some DSB<->MITE switching penalties are incurred.
In some cases, the misalignment doesn't hurt when it ends up better aligning the end of the loop. I tested the misalignment and it persists in the same way up to 200 uop loops. If you take the description of the predecoders at face value, it would seem that, as above, they should be able to hide a fetch bubble for misalignment, but it doesn't happen (perhaps the queue is not big enough).
The uop cache (Intel likes to call it the DSB) is able to cache most loops of moderate amount of instructions. In a typical program, you'd hope that most of your instructions are served out of this cache7.
We can repeat the test above, but now serving uops out of the uop cache. This is a simple matter of increasing the size of our nops to 2 bytes, so we no longer hit the 18-instruction limit. We use the 2-byte nop xchg ax, ax in our loop:
long_nop_test:
mov rax, iters
ALIGN 32
.top:
dec eax
xchg ax, ax ; this is a 2-byte nop
...
xchg ax, ax
jnz .top
ret
Run Code Online (Sandbox Code Playgroud)
Here, there results are very straightforward. For all tested loop sizes delivered out of the DSB, the number of cycles required was N/4 - i.e., the loops executed at the maximum theoretical throughput, even if they didn't have a multiple of 4 uops. So in general, on Skylake, moderately sized loops served out of the DSB shouldn't need to worry about ensuring the uop count meets some particular multiple.
Here's a graph out to 1,000 uop loops. If you squint, you can see the sub-optimal behavior before 64-uops (when the loop is in the LSD). After that, it's a straight shot, 4 IPC the whole way to 1,000 uops (with a blip around 900 that was probably due to load on my box):
Next we look at performance for loops that are small enough to fit in the uop cache.
Important note: Intel has apparently disabled the LSD on Skylake (SKL150 erratum) and Kaby Lake (KBL095, KBW095 erratum) chips via a microcode update and on Skylake-X out of the box, due to a bug related to the interaction between hyperthreading and the LSD. For those chips, the graph below will likely not have the interesting region up to 64 uops; rather, it will just look the same as the region after 64 uops.
The loop stream detector can cache small loops of up to 64 uops (on Skylake). In Intel's recent documentation it is positioned more as a power-saving mechanism than a performance feature - although there are certainly no performance downsides mentioned to using the LSD.
Running this for the loop sizes that should fit in the LSD, we get the following cycles/iteration behavior:
The red line here is the % of uops which are delivered from the LSD. It flatlines at 100% for all loop sizes from 5 to 56 uops.
For the 3 and 4 uop loops, we have the unusual behavior that 16% and 25% of the uops, respectively, are delivered from the legacy decoder. Huh? Luckily, it doesn't seem to affect the loop throughput as both cases achieve the maximum throughput of 1 loop/cycle - despite the fact that one could expect some MITE<->LSD transition penalties.
Between loop sizes of 57 and 62 uops, the number of uops delivered from LSD exhibits some weird behavior - approximately 70% of the uops are delivered from the LSD, and the rest from the DSB. Skylake nominally has a 64-uop LSD, so this is some kind of transition right before the LSD size is exceeded - perhaps there is some kind of internal alignment within the IDQ (on which the LSD is implemented) that causes only partial hits to the LSD in this phase. This phase is short and, performance-wise, seems mostly to be a linear combination of the full-in-LSD performance which precedes it, and the fully-in-DSB performance which follows it.
Let's look at the main body of results between 5 and 56 uops. We see three distinct regions:
Loops from 3 to 10 uops: Here, the behavior is complex. It is the only region where we see cycle counts that can't be explained by static behavior over a single loop iteration8. The range is short enough that it's hard to say if there is a pattern. Loops of 4, 6 and 8 uops all execute optimally, in N/4 cycles (that's the same pattern as the next region).
A loop of 10 uops, on the other hand, executes in 2.66 cycles per iteration, making it the only even loop size that doesn't execute optimally until you get to loop sizes of 34 uops or above (other than the outlier at 26). That corresponds to something like a repeated uop/cycle execution rate of 4, 4, 4, 3. For a loop of 5 uops, you get 1.33 cycles per iteration, very close but not the same as the ideal of 1.25. That corresponds to an execution rate of 4, 4, 4, 4, 3.
These results are hard to explain. The results are repeatable from run to run, and robust to changes such as swapping out the nop for an instruction that actually does something like mov ecx, 123. It might be something to do with the limit of 1 taken branch every 2 cycles, which applies to all loops except those that are "very small". It might be that the uops occasionally line up such that this limitation kicks in, leading to an extra cycle. Once you get to 12 uops or above, this never occurs since you are always taking at least three cycles per iteration.
Loops from 11 to 32-uops: We see a stair-step pattern, but with a period of two. Basically all loops with an even number of uops perform optimally - i.e., taking exactly N/4 cycles. Loops with odd number of uops waste one "issue slot", and take the same number of cycles as a loop with one more uops (i.e., a 17 uop loop takes the same 4.5 cycles as an 18 uop loop). So here we have behavior better than ceiling(N/4) for many uop counts, and we have the first evidence that Skylake at least can execute loops in a non-integral number of cycles.
The only outliers are N=25 and N=26, which both take about 1.5% longer than expected. It's small but reproducible, and robust to moving the function around in the file. That's too small to be explained by a per-iteration effect, unless it has a giant period, so it's probably something else.
The overall behavior here is exactly consistent (outside of the 25/26 anomaly) with the hardware unrolling the loop by a factor of 2.
Loops from 33 to ~64 uops: We see a stair-step pattern again, but with a period of 4, and worse average performance than the up-to 32 uop case. The behavior is exactly ceiling(N/4) - that is, the same as the legacy decoder case. So for loops of 32 to 64 uops, the LSD provides no apparent benefit over the legacy decoders, in terms of front end throughput for this particular limitation. Of course, there are many other ways the LSD is better - it avoids many of the potential decoding bottlenecks that occur for more complex or longer instructions, and it saves power, etc.
All of this is quite surprising, because it means that loops delivered from the uop cache generally perform better in the front end than loops delivered from the LSD, despite the LSD usually being positioned as a strictly better source of uops than the DSB (e.g., as part of advice to try to keep loops small enough to fit in the LSD).
Here's another way to look at the same data - in terms of the efficiency loss for a given uop count, versus the theoretical maximum throughput of 4 uops per cycle. A 10% efficiency hit means you only have 90% of the throughput that you'd calculate from the simple N/4 formula.
The overall behavior here is consistent with the hardware not doing any unrolling, which makes sense since a loop of more than 32 uops cannot be unrolled at all in a buffer of 64 uops.
The three regions discussed above are colored differently, and at least competing effects are visible:
Everything else being equal, the larger the number of uops involved, the lower the efficiency hit. The hit is a fixed cost only once per iteration, so larger loops pay a smaller relative cost.
There is a large jump in inefficiency when you cross to into the 33+ uop region: both the size of the throughput loss increases, and the number of affected uop counts doubles.
The first region is somewhat chaotic, and 7 uops is the worst overall uop count.
The DSB and LSD analysis above is for loop entries aligned to a 32-byte boundary, but the unaligned case doesn't seem to suffer in either case: there isn't a material difference from the aligned case (other than perhaps some small variation for less than 10 uops that I didn't investigate further).
Here's the unaligned results for 32N-2 and 32N+2 (i.e., the loop top 2 bytes before and after the 32B boundary):
The ideal N/4 line is also shown for reference.
Next next take a look at the prior microarchitecture: Haswell. The numbers here have been graciously provided by user Iwillnotexist Idonotexist.
First, the results from the "dense code" test which tests the LSD (for small uop counts) and the legacy pipeline (for larger uop counts, since the loop "busts out" of the DSB due to instruction density.
Immediately we see a difference already in terms of when each architecture delivers uops from the LSD for a dense loop. Below we compare Skylake and Haswell for short loops of dense code (1 byte per instruction).
As described above, the Skylake loop stops being delivered from the LSD at exactly 19 uops, as expected from the 18-uop per 32-byte region of code limit. Haswell, on the other hand, seems to stop delivering reliably from the LSD for the 16-uop and 17-uop loops as well. I don't have any explanation for this. There is also a difference in the 3-uop case: oddly both processors only deliver some of the their uops out of the LSD in the 3 and 4 uop cases, but the exact amount is the same for 4 uops, and different from 3.
We mostly care about the actual performance though, right? So let's look at the cycles/iteration for the 32-byte aligned dense code case:
This is the same data as show above for Skylake (the misaligned series has been removed), with Haswell plotted alongside. Immediately you notice that the pattern is similar for Haswell, but not the same. As above, there are two regions here:
The loops larger than ~16-18 uops (the uncertainty is described above) are delivered from the legacy decoders. The pattern for Haswell is somewhat different from Skylake.
For the range from 19-30 uops they are identical, but after that Haswell breaks the pattern. Skylake took ceil(N/4) cycles for loops delivered from the legacy decoders. Haswell, on the other hand, seems to take something like ceil((N+1)/4) + ceil((N+2)/12) - ceil((N+1)/12). OK, that's messy (shorter form, anyone?) - but basically it means that while Skylake executes loops with 4*N cycles optimally (i.e,. at 4-uops/cycle), such loops are (locally) usually the least optimal count (at least locally) - it takes one more cycle to execute such loops than Skylake. So you are actually best off with loops of 4N-1 uops on Haswell, except that the 25% of such loops that are also of the form 16-1N (31, 47, 63, etc) take one additional cycle. It's starting to sound like a leap year calculation - but the pattern is probably best understood visually above.
I don't think this pattern is intrinsic to uop dispatch on Haswell, so we shouldn't read to much into it. It seems to be explained by
0000000000455a80 <short_nop_aligned35.top>:
16B cycle
1 1 455a80: ff c8 dec eax
1 1 455a82: 90 nop
1 1 455a83: 90 nop
1 1 455a84: 90 nop
1 2 455a85: 90 nop
1 2 455a86: 90 nop
1 2 455a87: 90 nop
1 2 455a88: 90 nop
1 3 455a89: 90 nop
1 3 455a8a: 90 nop
1 3 455a8b: 90 nop
1 3 455a8c: 90 nop
1 4 455a8d: 90 nop
1 4 455a8e: 90 nop
1 4 455a8f: 90 nop
2 5 455a90: 90 nop
2 5 455a91: 90 nop
2 5 455a92: 90 nop
2 5 455a93: 90 nop
2 6 455a94: 90 nop
2 6 455a95: 90 nop
2 6 455a96: 90 nop
2 6 455a97: 90 nop
2 7 455a98: 90 nop
2 7 455a99: 90 nop
2 7 455a9a: 90 nop
2 7 455a9b: 90 nop
2 8 455a9c: 90 nop
2 8 455a9d: 90 nop
2 8 455a9e: 90 nop
2 8 455a9f: 90 nop
3 9 455aa0: 90 nop
3 9 455aa1: 90 nop
3 9 455aa2: 90 nop
3 9 455aa3: 75 db jne 455a80 <short_nop_aligned35.top>
Run Code Online (Sandbox Code Playgroud)
Here I've noted the 16B decode chunk (1-3) each instruction appears in, and the cycle in which it will be decoded. The rule is basically that up to the next 4 instructions are decoded, as long as they fall in the current 16B chunk. Otherwise they have to wait
这是原始答案的后续,基于Andreas Abel提供的测试结果来分析另外五种体系结构的行为:
除了Skylake和Haswell,我们还将快速查看这些架构的结果。由于Nehalem以外的所有体系结构都遵循上述现有模式之一,因此它只需要“快速”外观。
首先,简短的nop情况适用于传统的解码器(用于LSD中不适合的循环)和LSD。这是针对所有7个架构的此场景的周期/迭代。
图2.1:所有架构都具有密集的nop性能:
该图确实很忙(单击以查看大图),并且由于许多体系结构的结果相互重叠而难以阅读,但是我尝试确保专用的读者可以跟踪任何体系结构的线。
首先,让我们讨论更大的离群值:Nehalem。所有其他架构的斜率都大致遵循4 oups /周期的线,但是Nehalem每个周期几乎正好为3 uops,因此很快落后于所有其他架构。在初始LSD区域之外,该线也完全平滑,没有其他架构中看到的“阶梯”外观。
这是与具有UOP Nehalem的完全一致的退休 3个微指令/周期的限制。这是LSD之外的ouop的瓶颈:它们每个周期执行的时间大约为3 uops,在退休时成为瓶颈。前端不是瓶颈,因此确切的uop计数和解码安排无关紧要,因此没有阶梯。
除了Nehalem之外,除Broadwell以外的其他体系结构都非常干净地分为几类:类似Haswell或Skylake的结构。也就是说,对于大于约15 oups的循环,所有Sandy Bridge,Ivy Bridge和Haswell的行为都类似于Haswell(在另一个答案中讨论了Haswell行为)。即使它们是不同的微体系结构,它们的行为也基本相同,因为它们的传统解码功能是相同的。低于约15 ups,对于任何uop计数(不是4的倍数),我们都认为Haswell更快一些。也许由于更大的LSD,或者由于其他“小循环”优化,它在LSD中会产生额外的展开。对于Sandy Bridge和Ivy Bridge而言,这意味着小循环绝对应以uop计数为目标,该计数是4的倍数。
Coffee Lake的行为与Skylake 1类似。这是有道理的,因为微架构是相同的。在大约16 oups以下,Coffee Lake看起来比Skylake更好,但是默认情况下,这只是Coffee Lake的禁用LSD的影响。Skylake已通过启用的LSD进行了测试,之后英特尔由于安全问题通过微码更新将其禁用。在已知此问题后发布了Coffee Lake,因此LSD禁用了现成的功能。因此,对于此测试,Coffee Lake使用的是DSB(适用于大约18 uops以下的循环,仍可以容纳在DSB中)或旧版解码器(适用于其余的循环),这对于较小的uop计数会带来更好的结果LSD施加开销的循环(有趣的是,对于较大的循环,由于非常不同的原因,LSD和传统解码器恰好施加了相同的开销)。
最后,我们看一下2字节的NOP,它们的密度不足以阻止使用DSB(因此,这种情况更能反映典型的代码)。
图2.1:2字节nop性能:
同样,结果与之前的图表相同。Nehalem仍是每个周期3微秒的瓶颈。对于大约60欧元的范围,除Coffee Lake以外的所有体系结构都使用LSD,我们看到Sandy Bridge和Ivy Bridge在这里的性能稍差一些,舍入到下一个周期,因此只能达到最大吞吐量4 uops / cycle(如果循环中uops的数目是4的倍数)。高于32 uops时,Haswell和新uarch的“展开”功能无效,因此所有内容都大致相同。
桑迪桥实际上有一些uop范围(例如,从36到44 oups),其性能要比较新的体系结构更好。出现这种情况似乎是因为LSD并非检测到所有循环,而是在这些范围内由DSB提供循环。由于DSB通常更快,因此在这些情况下,Sandy Bridge也是如此。
正如Andreas Abel在评论中指出的,您实际上可以在《英特尔优化手册》的3.4.2.5节中找到专门处理此主题的部分。英特尔说:
LSD包含微操作,这些操作构成了小的“无限”循环。来自LSD的微操作是在乱序引擎中分配的。LSD中的循环以到循环开始的分支分支结束。循环结束时采用的分支始终是循环中分配的最后一个微操作。循环开始处的指令始终在下一个周期分配。如果代码性能受到前端带宽的限制,则未使用的分配时隙将导致分配泡沫,并可能导致性能下降。英特尔微体系结构代码名称Sandy Bridge中的分配带宽为每个周期四个微操作。当LSD中的微操作数量导致未使用的分配插槽数量最少时,性能最佳。您可以使用循环展开来控制LSD中的微型操作数。
他们继续显示一个示例,由于LSD的“四舍五入”,将循环展开2倍对性能没有帮助,但是展开3幅作品。该示例非常令人困惑,因为它实际上混合了两种效果,因为展开更多还可以减少循环开销,从而减少每次迭代的微指令数。一个更有趣的示例是,由于LSD舍入效应,展开循环次数减少导致性能提高。
本节似乎准确地描述了Sandy Bridge和Ivy Bridge中的行为。上面的结果表明,这两种体系结构均按说明进行操作,并且对于分别具有4N + 3、4N + 2或4N + 1的循环,您将丢失1、2或3个uop执行时隙。
但是,Haswell及其以后的新性能尚未进行更新。如另一个答案中所述,性能已从上述简单模型中提高,并且行为更为复杂。
1在16 uop处有一个怪异的异常值,其中Coffee Lake的性能比所有其他体系结构都要差,甚至包括Nehalem(约50%的回归),但也许是这种测量噪声?