Mar*_*oom 13 performance x86 assembly cpu-architecture perf
我正在玩这个答案的代码,稍微修改一下:
BITS 64
GLOBAL _start
SECTION .text
_start:
mov ecx, 1000000
.loop:
;T is a symbol defined with the CLI (-DT=...)
TIMES T imul eax, eax
lfence
TIMES T imul edx, edx
dec ecx
jnz .loop
mov eax, 60 ;sys_exit
xor edi, edi
syscall
Run Code Online (Sandbox Code Playgroud)
没有lfence我,我得到的结果与答案中的静态分析一致.
当我介绍一个单一 lfence我期望的CPU执行imul edx, edx的序列的第k个平行于迭代imul eax, eax的下一个(的序列K + 1个)迭代.
像这样的东西(调用一个的imul eax, eax序列和d的imul edx, edx一个):
|
| A
| D A
| D A
| D A
| ...
| D A
| D
|
V time
Run Code Online (Sandbox Code Playgroud)
采用或多或少相同数量的周期,但是对于一个不成对的并行执行.
当我测量周期的数量,原始和修改后的版本,以taskset -c 2 ocperf.py stat -r 5 -e cycles:u '-x ' ./main-$T用于T在以下范围内,我得到
T Cycles:u Cycles:u Delta
lfence no lfence
10 42047564 30039060 12008504
15 58561018 45058832 13502186
20 75096403 60078056 15018347
25 91397069 75116661 16280408
30 108032041 90103844 17928197
35 124663013 105155678 19507335
40 140145764 120146110 19999654
45 156721111 135158434 21562677
50 172001996 150181473 21820523
55 191229173 165196260 26032913
60 221881438 180170249 41711189
65 250983063 195306576 55676487
70 281102683 210255704 70846979
75 312319626 225314892 87004734
80 339836648 240320162 99516486
85 372344426 255358484 116985942
90 401630332 270320076 131310256
95 431465386 285955731 145509655
100 460786274 305050719 155735555
Run Code Online (Sandbox Code Playgroud)
如何Cycles:u lfence解释这些价值?
我希望它们与那些相似,Cycles:u no lfence因为单个lfence应该只阻止第一次迭代并行执行两个块.
我不认为这是由于lfence开销,因为我认为应该对所有人保持不变T.
在处理代码的静态分析时,我想解决我的形式问题.
I think you're measuring accurately, and the explanation is microarchitectural, not any kind of measurement error.
I think your results for mid to low T support the conclusion that lfence stops the front-end from even issuing past the lfence until all earlier instructions retire, rather than having all the uops from both chains already issued and just waiting for lfence to flip a switch and let multiplies from each chain start to dispatch on alternating cycles.
(port1 would get edx,eax,empty,edx,eax,empty,... for Skylake's 3c latency/1c throughput multiplier right away, if lfence didn't block the front-end, and overhead wouldn't scale with T.)
imul当只有来自第一个链的uops在调度程序中时,你会失去吞吐量,因为前端还没有通过imul edx,edxand loop分支进行咀嚼.并且对于窗口末端的相同循环次数,当管道大部分被排空并且仅剩下来自第二链的uops时.
开销增量看起来是线性的,约为T = 60.我没有运行数字,但是到那里的斜率看起来合理的T * 0.25时钟发出第一链和3c延迟执行瓶颈.即delta增长速度可能是总无噪声周期的1/12.
所以(考虑到lfence我在下面测量的开销),T <60:
no_lfence cycles/iter ~= 3T # OoO exec finds all the parallelism
lfence cycles/iter ~= 3T + T/4 + 9.3 # lfence constant + front-end delay
delta ~= T/4 + 9.3
Run Code Online (Sandbox Code Playgroud)
@Margaret reports that T/4 is a better fit than 2*T / 4, but I would have expected T/4 at both the start and end, for a total of 2T/4 slope of the delta.
After about T=60, delta grows much more quickly (but still linearly), with a slope about equal to the total no-lfence cycles, thus about 3c per T. I think at that point, the scheduler (Reservation Station) size is limiting the out-of-order window. You probably tested on a Haswell or Sandybridge/IvyBridge, (which have a 60-entry or 54-entry scheduler respectively. Skylake's is 97 entry.
The RS tracks un-executed uops. Each RS entry holds 1 unfused-domain uop that's waiting for its inputs to be ready, and its execution port, before it can dispatch and leave the RS1.
在a之后lfence,前端以每时钟4个发出,而后端以每3个时钟1个为单位执行,在~15个周期内发出60个uop,在此期间,链中只有5个imul指令edx被执行.(这里没有加载或存储微融合,因此来自前端的每个融合域uop在RS 2中仍然只有1个未融合域uop .)
对于大T,RS快速填满,此时前端只能以后端的速度进行.(对于小T,我们lfence在发生之前就进行了下一次迭代,这就是前端的停滞). 当T> RS_size时,后端无法看到来自eaximul链的任何uop,直到通过edx链的足够的后端进度在RS中腾出空间.此时,imul每个链中的一个可以每3个周期发送一次,而不仅仅是第一个或第二个链.
请记住,从第一部分开始,只在lfence执行第一个链之后花费的时间=仅在lfence执行第二个链之前的时间.这也适用于此.
We get some of this effect even with no lfence, for T > RS_size, but there's opportunity for overlap on both sides of a long chain. The ROB is at least twice the size of the RS, so the out-of-order window when not stalled by lfence should be able to keep both chains in flight constantly even when T is somewhat larger than the scheduler capacity. (Remember that uops leave the RS as soon as they've executed. I'm not sure if that means they have to finish executing and forward their result, or merely start executing, but that's a minor difference here for short ALU instructions. Once they're done, only the ROB is holding onto them until they retire, in program order.)
ROB和寄存器文件不应该在这种假设情况下或在您的实际情况下限制无序窗口大小(http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/)情况.它们应该都很大.
阻止前端是lfence英特尔搜索的实现细节.该手册仅表示后续指令无法执行.lfence只要没有人被分派到执行单元,该措辞将允许前端将它们全部发布/重命名为调度程序(Reservation Station)和ROB,同时仍在等待.
所以较弱的lfence可能会有平坦的开销直到T = RS_size,然后是你现在看到的T> 60的相同斜率. (并且开销的不变部分可能会更低.)
请注意,在lfence应用于执行后,保证有关条件/间接分支的推测执行,而不是(据我所知)代码获取.仅仅触发代码获取(AFAIK)对于Spectre或Meltdown攻击并非有用.可能是一个定时侧通道来检测它的解码方式可能会告诉你有关所提取代码的信息......
我认为当相关的MSR启用时,AMD的LFENCE至少与实际的AMD CPU一样强.(LFENCE是否在AMD处理器上进行序列化?).
lfence开销:你的结果很有意思,但它本身并不让我感到惊讶,因为它lfence本身(对于小T)以及与T一起扩展的组件有显着的恒定开销.
请记住,lfence在先前的指令退役之前,不允许以后的指令启动.这可能至少比他们的结果准备好绕过前往其他执行单元(即正常延迟)的时候至少有几个周期/流水线阶段.
因此对于小T来说,通过要求结果不仅准备就绪,而且还要写回寄存器文件,您必须在链中添加额外的延迟,这一点非常重要.
It probably takes an extra cycle or so for lfence to allow the issue/rename stage to start operating again after detecting retirement of the last instruction before it. The issue/rename process takes multiple stages (cycles), and maybe lfence blocks at the start of this, instead of in the very last step before uops are added into the OoO part of the core.
Even back-to-back lfence itself has 4 cycle throughput on SnB-family, according to Agner Fog's testing. Agner Fog reports 2 fused-domain uops (no unfused), but on Skylake I measure it at 6 fused-domain (still no unfused) if I only have 1 lfence. But with more lfence back-to-back, it's fewer uops! Down to ~2 uops per lfence with many back-to-back, which is how Agner measures.
lfence/dec/jnz (a tight loop with no work) runs at 1 iteration per ~10 cycles on SKL, so that might give us an idea of the real extra latency that lfence adds to the dep chains even without the front-end and RS-full bottlenecks.
Measuring lfence overhead with only one dep chain, OoO exec being irrelevant:
.loop:
;mfence ; mfence here: ~62.3c (with no lfence)
lfence ; lfence here: ~39.3c
times 10 imul eax,eax ; with no lfence: 30.0c
; lfence ; lfence here: ~39.6c
dec ecx
jnz .loop
Run Code Online (Sandbox Code Playgroud)
Without lfence, runs at the expected 30.0c per iter. With lfence, runs at ~39.3c per iter, so lfence effectively added ~9.3c of "extra latency" to the critical path dep chain. (And 6 extra fused-domain uops).
With lfence after the imul chain, right before the loop-branch, it's slightly slower. But not a whole cycle slower, so that would indicate that the front-end is issuing the loop-branch + and imul in a single issue-group after lfence allows execution to resume. That being the case, IDK why it's slower. It's not from branch misses.
Interleave the chains in program order, like @BeeOnRope suggests in comments, doesn't require out-of-order execution to exploit the ILP, so it's pretty trivial:
.loop:
lfence ; at the top of the loop is the lowest-overhead place.
%rep T
imul eax,eax
imul edx,edx
%endrep
dec ecx
jnz .loop
Run Code Online (Sandbox Code Playgroud)
You could put pairs of short times 8 imul chains inside a %rep to let OoO exec have an easy time.
My mental model is that the issue/rename/allocate stages in the front-end add new uops to both the RS and the ROB at the same time.
Uops leave the RS after executing, but stay in the ROB until in-order retirement. The ROB can be large because it's never scanned out-of-order to find the first-ready uop, only scanned in-order to check if the oldest uop(s) have finished executing and thus are ready to retire.
(I assume the ROB is physically a circular buffer with start/end indices, not a queue which actually copies uops to the right every cycle. But just think of it as a queue/list with a fixed max size, where the front-end adds uops at the front, and the retirement logic retires/commits uops from the end as long as they're fully executed, up to some per-cycle per-hyperthread retirement limit which is not usually a bottleneck. Skylake did increase it for better Hyperthreading, maybe to 8 per clock per logical thread. Perhaps retirement also means freeing physical registers which helps HT, because the ROB itself is statically partitioned when both threads are active. That's why retirement limits are per logical thread.)
Uops like nop, xor eax,eax, or lfence, which are handled in the front-end (don't need any execution units on any ports) are added only to the ROB, in an already-executed state. (A ROB entry presumably has a bit that marks it as ready to retire vs. still waiting for execution to complete. This is the state I'm talking about. For uops that did need an execution port, I assume the ROB bit is set via a completion port from the execution unit. And that the same completion-port signal frees its RS entry.)
Uops stay in the ROB from issue to retirement.
Uops stay in the RS from issue to execution. The RS can replay uops in a few cases, e.g. for the other half of a cache-line-split load, or if it was dispatched in anticipation of load data arriving, but in fact it didn't. (Cache miss or other conflicts like Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?.) Or when a load port speculates that it can bypass the AGU before starting a TLB lookup to shorten pointer-chasing latency with small offsets - Is there a penalty when base+offset is in a different page than the base?
So we know that the RS can't remove a uop right as it dispatches, because it might need to be replayed. (Can happen even to non-load uops that consume load data.) But any speculation that needs replays is short-range, not through a chain of uops, so once a result comes out the other end of an execution unit, the uop can be removed from the RS. Probably this is part of what a completion port does, along with putting the result on the bypass forwarding network.
TL:DR: P6-family: RS is fused, SnB-family: RS is unfused.
A micro-fused uop is issued to two separate RS entries in Sandybridge-family, but only 1 ROB entry. (Assuming it isn't un-laminated before issue, see section 2.3.5 for HSW or section 2.4.2.4 for SnB of Intel's optimization manual, and Micro fusion and addressing modes. Sandybridge-family's more compact uop format can't represent indexed addressing modes in the ROB in all cases.)
The load can dispatch independently, ahead of the other operand for the ALU uop being ready. (Or for micro-fused stores, either of the store-address or store-data uops can dispatch when its input is ready, without waiting for both.)
I used the two-dep-chain method from the question to experimentally test this on Skylake (RS size = 97), with micro-fused or edi, [rdi] vs. mov+or, and another dep chain in rsi. (Full test code, NASM syntax on Godbolt)
; loop body
%rep T
%if FUSE
or edi, [rdi] ; static buffers are in the low 32 bits of address space, in non-PIE
%else
mov eax, [rdi]
or edi, eax
%endif
%endrep
%rep T
%if FUSE
or esi, [rsi]
%else
mov eax, [rsi]
or esi, eax
%endif
%endrep
Run Code Online (Sandbox Code Playgroud)
Looking at uops_executed.thread (unfused-domain) per cycle (or per second which perf calculates for us), we can see a throughput number that doesn't depend on separate vs. folded loads.
With small T (T=30), all the ILP can be exploited, and we get ~0.67 uops per clock with or without micro-fusion. (I'm ignoring the small bias of 1 extra uop per loop iteration from dec/jnz. It's negligible compared to the effect we'd see if micro-fused uops only used 1 RS entry)
Remember that load+or is 2 uops, and we have 2 dep chains in flight, so this is 4/6, because or edi, [rdi] has 6 cycle latency. (Not 5, which is surprising, see below.)
At T=60, we still have about 0.66 unfused uops executed per clock for FUSE=0, and 0.64 for FUSE=1. We can still find basically all the ILP, but it's just barely starting to dip, as the two dep chains are 120 uops long (vs. a RS size of 97).
At T=120, we have 0.45 unfused uops per clock for FUSE=0, and 0.44 for FUSE=1. We're definitely past the knee here, but still finding some of the ILP.
If a micro-fused uop took only 1 RS entry, FUSE=1 T=120 should be about the same speed as FUSE=0 T=60, but that's not the case. Instead, FUSE=0 or 1 makes nearly no difference at any T. (Including larger ones like T=200: FUSE=0: 0.395 uops/clock, FUSE=1: 0.391 uops/clock). We'd have to go to very large T before we start for the time with 1 dep-chain in flight to totally dominate the time with 2 in flight, and get down to 0.33 uops/clock (2/6).
Oddity: We have such a small but still measurable difference in throughput for fused vs. unfused, with separate mov loads being faster.
Other oddities: the total uops_executed.thread is slightly lower for FUSE=0 at any given T. Like 2,418,826,591 vs. 2,419,020,155 for T=60. This difference was repeatable down to +- 60k out of 2.4G, plenty precise enough. FUSE=1 is slower in total clock cycles, but most of the difference comes from lower uops per clock, not from more uops.
Simple addressing modes like [rdi] are supposed to only have 4 cycle latency, so load + ALU should be only 5 cycle. But I measure 6 cycle latency for the load-use latency of or rdi, [rdi], or with a separate MOV-load, or with any other ALU instruction I can never get the load part to be 4c.
A complex addressing mode like [rdi + rbx + 2064] has the same latency when there's an ALU instruction in the dep chain, so it appears that Intel's 4c latency for simple addressing modes only applies when a load is forwarding to the base register of another load (with up to a +0..2047 displacement and no index).
Pointer-chasing is common enough that this is a useful optimization, but we need to think of it as a special load-load forwarding fast-path, not as a general data ready sooner for use by ALU instructions.
P6-family is different: an RS entry holds a fused-domain uop.
@Hadi found an Intel patent from 2002, where Figure 12 shows the RS in the fused domain.
Experimental testing on a Conroe (first gen Core2Duo, E6600) shows that there's a large difference between FUSE=0 and FUSE=1 for T=50. (The RS size is 32 entries).
T=50 FUSE=0: total time of 3.272G cycles (0.62IPC = 0.31 load+OR per clock). (perf/ocperf.py doesn't have events for uops_executed on uarches before Nehalem or so, and I don't have oprofile installed on that machine.)
T=24 there's a negligible difference between FUSE=0 and FUSE=1, around 0.47 IPC vs 0.9 IPC (~0.45 load+OR per clock).
T=24 is still over 96 bytes of code in the loop, too big for Core 2's 64-byte (pre-decode) loop buffer, so it's not faster because of fitting in a loop buffer. Without a uop-cache, we have to be worried about the front-end, but I think we're fine because I'm exclusively using 2-byte single-uop instructions that should easily decode at 4 fused-domain uops per clock.
我将针对两种代码(有和没有lfence)的T = 1 的情况进行分析。然后,您可以将其扩展为 T 的其他值。您可以参考英特尔优化手册的图 2.4 以获取视觉效果。
因为只有一个容易预测的分支,如果后端停止,前端才会停止。Haswell 中的前端是 4 宽的,这意味着最多可以从 IDQ(指令解码队列,它只是一个包含有序融合域 uops 的队列,也称为 uop 队列)发出 4 个融合的 uops 到保留站 (RS) 整个调度程序。每个都imul被解码为一个无法融合的 uop。说明dec ecx和jnz .loop在前端宏融合到单个 uop。微融合和宏融合之间的区别之一是,当调度程序将宏融合 uop(非微融合)分派给它分配给的执行单元时,它会作为单个 uop 分派。相比之下,微融合 uop 需要拆分为其组成的 uop,每个组成的 uop 必须单独分派到执行单元。(但是,拆分微融合 uops 发生在 RS 的入口处,而不是在派遣时,请参阅@Peter 回答中的脚注 2)。lfence被解码为 6 uop。识别微融合仅在后端重要,在这种情况下,循环中没有微融合。
由于循环分支很容易预测并且迭代次数相对较多,我们可以在不影响准确性的情况下假设分配器始终能够在每个周期分配 4 个 uops。换句话说,调度程序每个周期将收到 4 个 uops。由于没有微融合,每个 uop 将作为单个 uop 发送。
imul只能由 Slow Int 执行单元执行(见图 2.4)。这意味着执行imuluops的唯一选择是将它们分派到端口 1。在 Haswell 中,Slow Int 被很好地流水线化,因此imul每个周期可以分派一个。但是乘法的结果需要三个周期才能用于任何需要的指令(写回阶段是流水线分派阶段的第三个周期)。所以对于每个依赖链,imul每 3 个周期最多可以调度一个。
因为dec/jnz被预测采用,唯一可以执行它的执行单元是端口 6 上的 Primary Branch。
所以在任何给定的周期,只要 RS 有空间,它就会收到 4 uop。但是什么样的 uops 呢?让我们检查没有 lfence 的循环:
imul eax, eax
imul edx, edx
dec ecx/jnz .loop (macrofused)
Run Code Online (Sandbox Code Playgroud)
有两种可能:
imul来自同一迭代,一个imul来自相邻迭代,一个dec/jnz来自这两次迭代之一。dec/jnz来自一次迭代,两个imul来自下一次迭代,一个dec/jnz来自同一次迭代。所以在任何一个周期的开始,RS 都会从每条链dec/jnz中至少收到一个,并且至少收到一个imul。同时,在同一个周期中,从 RS 中已经存在的那些 uops 中,调度程序将执行以下两个操作之一:
dec/jnz发送到端口 6,并将imul准备好的最旧的发送到端口 1。总共 2 个 uops。imulRS 中没有准备好执行。但是,dec/jnzRS 中总是至少有一个。所以调度器可以调度它。这总共是1 uop。现在,我们可以计算出RS微指令的预期数量,X ñ,在任何给定的周期N结尾:
X N = X N-1 +(在周期 N 开始时要在 RS 中分配的 uops 数量)-(将在周期 N 开始时分派的预期 uops 数量)
= X N-1 + 4 - ((0+1)*1/3 + (1+1)*2/3)
= X N-1 + 12/3 - 5/3
= X N-1 + 7/3 对于所有 N > 0
递归的初始条件是 X 0 = 4。这是一个简单的递归,可以通过展开 X N-1来解决。
X N = 4 + 2.3 * N 对于所有 N >= 0
Haswell 中的 RS 有 60 个条目。我们可以确定 RS 预期变满的第一个周期:
60 = 4 + 7/3 * N
N = 56/2.3 = 24.3
因此,在周期 24.3 结束时,RS 预计已满。这意味着在周期 25.3 开始时,RS 将无法接收任何新的 uops。现在考虑的迭代次数 I 决定了您应该如何进行分析。由于依赖链至少需要 3*I 个周期才能执行,因此大约需要 8.1 次迭代才能达到周期 24.3。所以如果迭代次数大于8.1,这里就是这种情况,就需要分析24.3循环之后的情况。
调度程序在每个周期以下列速率分派指令(如上所述):
1
2
2
1
2
2
1
2
.
.
Run Code Online (Sandbox Code Playgroud)
但是分配器不会在 RS 中分配任何微指令,除非至少有 4 个可用条目。否则,它不会在以次优吞吐量发出 uops 时浪费电力。但是,只有在每 4 个周期开始时,RS 中才至少有 4 个空闲条目。因此,从周期 24.3 开始,分配器预计每 4 个周期中将停止 3 个。
对正在分析的代码的另一个重要观察是,永远不会发生超过 4 个可调度的 uops,这意味着每个周期离开其执行单元的平均 uops 数不大于 4。最多 4 个 uops可以从重新排序缓冲区 (ROB) 中退出。这意味着 ROB 永远不会在关键路径上。换句话说,性能由调度吞吐量决定。
我们现在可以很容易地计算 IPC(每个周期的指令)。ROB 条目如下所示:
imul eax, eax - N
imul edx, edx - N + 1
dec ecx/jnz .loop - M
imul eax, eax - N + 3
imul edx, edx - N + 4
dec ecx/jnz .loop - M + 1
Run Code Online (Sandbox Code Playgroud)
右侧的列显示指令可以退出的周期。退休按顺序发生,并受关键路径的延迟限制。这里每个依赖链具有相同的路径长度,因此都构成了两个长度为 3 个循环的相等关键路径。因此,每 3 个周期,可以退出 4 条指令。因此,IPC 为 4/3 = 1.3,CPI 为 3/4 = 0.75。这远小于 4 的理论最佳 IPC(即使不考虑微观和宏观融合)。因为退休是按顺序发生的,所以退休行为将是相同的。
我们可以使用perfIACA 和 IACA来检查我们的分析。我会讨论的perf。我有一个 Haswell CPU。
perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-nolfence
Performance counter stats for './main-1-nolfence' (10 runs):
30,01,556 cycles:u ( +- 0.00% )
40,00,005 instructions:u # 1.33 insns per cycle ( +- 0.00% )
0 RESOURCE_STALLS.ROB
23,42,246 UOPS_ISSUED.ANY ( +- 0.26% )
22,49,892 RESOURCE_STALLS.RS ( +- 0.00% )
0.001061681 seconds time elapsed ( +- 0.48% )
Run Code Online (Sandbox Code Playgroud)
有 100 万次迭代,每次大约需要 3 个周期。每次迭代包含 4 条指令,IPC 为 1.33。RESOURCE_STALLS.ROB显示分配器由于完整的 ROB 而停滞的周期数。这当然永远不会发生。UOPS_ISSUED.ANY可用于计算向 RS 发出的 uops 数和分配器停滞的周期数(没有具体原因)。第一个很简单(未在perf输出中显示);100万*3=300万+小噪音。后者更有趣。它显示大约 73% 的分配器由于完整的 RS 而停滞,这与我们的分析相符。RESOURCE_STALLS.RS计算由于 RS 满而导致分配器停止的周期数。这接近UOPS_ISSUED.ANY 因为分配器不会因为任何其他原因而停止(尽管出于某种原因,差异可能与迭代次数成正比,但我必须查看 T>1 的结果)。
lfence可以扩展对without 代码的分析,以确定如果lfence在两个imuls之间添加an 会发生什么。我们先来看看perf结果(可惜IACA不支持lfence):
perf stat -r 10 -e cycles:u,instructions:u,cpu/event=0xA2,umask=0x10,name=RESOURCE_STALLS.ROB/u,cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u,cpu/event=0xA2,umask=0x4,name=RESOURCE_STALLS.RS/u ./main-1-lfence
Performance counter stats for './main-1-lfence' (10 runs):
1,32,55,451 cycles:u ( +- 0.01% )
50,00,007 instructions:u # 0.38 insns per cycle ( +- 0.00% )
0 RESOURCE_STALLS.ROB
1,03,84,640 UOPS_ISSUED.ANY ( +- 0.04% )
0 RESOURCE_STALLS.RS
0.004163500 seconds time elapsed ( +- 0.41% )
Run Code Online (Sandbox Code Playgroud)
观察到循环数增加了大约 1000 万,或每次迭代 10 个循环。周期数并不能告诉我们太多。退休指令的数量增加了一百万,这是预期的。我们已经知道这lfence不会使指令完成得更快,所以RESOURCE_STALLS.ROB不应该改变。UOPS_ISSUED.ANY并且RESOURCE_STALLS.RS特别有趣。在此输出中,UOPS_ISSUED.ANY计数周期,而不是 uops。还可以计算 uops 的数量(使用cpu/event=0x0E,umask=0x1,name=UOPS_ISSUED.ANY/u代替cpu/event=0x0E,umask=0x1,cmask=1,inv=1,name=UOPS_ISSUED.ANY/u),并且每次迭代增加 6 个 uops(无融合)。这意味着lfence放置在两个imuls之间的an被解码为 6 uop。一百万美元的问题现在是这些 uops 做什么以及它们如何在管道中移动。
RESOURCE_STALLS.RS为零。这意味着什么?这表明分配器,当它lfence在 IDQ 中看到 an时,它停止分配,直到 ROB 中的所有当前 uops 都退休。换句话说,分配器lfence在lfence退休之前不会在 RS 中分配条目。由于循环体仅包含 3 个其他 uop,因此 60 个条目的 RS 永远不会满。事实上,它总是几乎是空的。
现实中的 IDQ 不是一个简单的队列。它由多个可以并行运行的硬件结构组成。lfence所需的微指令数量取决于 IDQ 的确切设计。分配器也由许多不同的硬件结构组成,当它看到lfenceIDQ 的任何结构的前面有一个 uops 时,它会暂停从该结构进行分配,直到 ROB 为空。因此,不同的 uops 使用不同的硬件结构。
UOPS_ISSUED.ANY显示分配器在每次迭代大约 9-10 个周期内没有发出任何 uops。这里发生了什么?嗯, 的一个用途lfence是它可以告诉我们退出一条指令并分配下一条指令需要多长时间。以下汇编代码可用于执行此操作:
TIMES T lfence
Run Code Online (Sandbox Code Playgroud)
性能事件计数器不适用于较小的 值T。对于足够大的 T,通过测量UOPS_ISSUED.ANY,我们可以确定退出每个 需要大约 4 个周期lfence。那是因为UOPS_ISSUED.ANY每 5 个周期会增加大约 4 次。因此,每 4 个周期后,分配器发出另一个lfence(它不会停止),然后再等待 4 个周期,依此类推。也就是说,根据指令,产生结果的指令可能需要 1 个或几个周期才能退出。IACA 总是假设退出一条指令需要 5 个周期。
我们的循环看起来像这样:
imul eax, eax
lfence
imul edx, edx
dec ecx
jnz .loop
Run Code Online (Sandbox Code Playgroud)
在lfence边界处的任何周期,ROB 将包含以下指令,从 ROB 的顶部开始(最旧的指令):
imul edx, edx - N
dec ecx/jnz .loop - N
imul eax, eax - N+1
Run Code Online (Sandbox Code Playgroud)
其中 N 表示相应指令被分派的周期数。将要完成(到达写回阶段)的最后一条指令是imul eax, eax. 这发生在周期 N+4。分配器停顿周期计数将在周期 N+1、N+2、N+3 和 N+4 期间递增。然而,它会在imul eax, eax退休之前大约 5 个周期。另外,在它退休后,分配器需要lfence从 IDQ 中清除uop 并分配下一组指令,然后才能在下一个周期中进行调度。该perf输出告诉我们,它需要大约13个周期每次迭代和分配器摊位(因为lfence)10这些13个周期。
问题中的图表仅显示最多 T=100 的循环数。但是,此时还有另一个(最终)膝盖。因此最好绘制最多 T=120 的周期以查看完整模式。
| 归档时间: |
|
| 查看次数: |
472 次 |
| 最近记录: |