Z b*_*son 44 cpu x86 assembly intel iaca
我使用英特尔®架构代码分析器(IACA)发现了一些意想不到的东西(对我而言).
以下指令使用[base+index]寻址
addps xmm1, xmmword ptr [rsi+rax*1]
Run Code Online (Sandbox Code Playgroud)
根据IACA没有微熔丝.但是,如果我用[base+offset]这样的
addps xmm1, xmmword ptr [rsi]
Run Code Online (Sandbox Code Playgroud)
IACA报告它确实融合了.
英特尔优化参考手册的第2-11节给出了以下"可以由所有解码器处理的微融合微操作"的示例
FADD DOUBLE PTR [RDI + RSI*8]
Run Code Online (Sandbox Code Playgroud)
和Agner Fog的优化装配手册也给出了使用[base+index]寻址的微操作融合的例子.例如,请参见第12.2节"Core2上的相同示例".那么正确的答案是什么?
Pet*_*des 36
In the decoders and uop-cache, addressing mode doesn't affect micro-fusion (except that an instruction with an immediate operand can't micro-fuse a RIP-relative addressing mode).
But some combinations of uop and addressing mode can't stay micro-fused in the ROB (in the out-of-order core), so Intel SnB-family CPUs "un-laminate" when necessary, at some point before the issue/rename stage. For issue-throughput, and out-of-order window size (ROB-size), fused-domain uop count after un-lamination is what matters.
Intel's optimization manual describes un-lamination for Sandybridge in Section 2.3.2.4: Micro-op Queue and the Loop Stream Detector (LSD), but doesn't describe the changes for any later microarchitectures.
UPDATE: Now Intel manual has a detailed section to describe un-lamination for Haswell. See section 2.3.5 Unlamination. And a brief description for SandyBridge is in section 2.4.2.4.
The rules, as best I can tell from experiments on SnB, HSW, and SKL:
adc and cmov don't micro-fuse. Most VEX-encoded instructions also don't fuse since they generally have three operands (so paddb xmm0, [rdi+rbx] fuses but vpaddb xmm0, xmm0, [rdi+rbx] doesn't). Finally, the occasional 2-operand instruction where the first operand is write only, such as pabsb xmm0, [rax + rbx] also do not fuse. IACA is wrong, applying the SnB rules.Related: simple (non-indexed) addressing modes are the only ones that the dedicated store-address unit on port7 (Haswell and later) can handle, so it's still potentially useful to avoid indexed addressing modes for stores. (A good trick for this is to address your dst with a single register, but src with dst+(initial_src-initial_dst). Then you only have to increment the dst register inside a loop.)
Note that some instructions never micro-fuse at all (even in the decoders/uop-cache). e.g. shufps xmm, [mem], imm8, or vinsertf128 ymm, ymm, [mem], imm8, are always 2 uops on SnB through Skylake, even though their register-source versions are only 1 uop. This is typical for instructions with an imm8 control operand plus the usual dest/src1, src2 register/memory operands, but there are a few other cases. e.g. PSRLW/D/Q xmm,[mem] (vector shift count from a memory operand) doesn't micro-fuse, and neither does PMULLD.
See also this post on Agner Fog's blog for discussion about issue throughput limits on HSW/SKL when you read lots of registers: Lots of micro-fusion with indexed addressing modes can lead to slowdowns vs. the same instructions with fewer register operands: one-register addressing modes and immediates. We don't know the cause yet, but I suspect some kind of register-read limit, maybe related to reading lots of cold registers from the PRF.
Test cases, numbers from real measurements: These all micro-fuse in the decoders, AFAIK, even if they're later un-laminated.
# store
mov [rax], edi SnB/HSW/SKL: 1 fused-domain, 2 unfused. The store-address uop can run on port7.
mov [rax+rsi], edi SnB: unlaminated. HSW/SKL: stays micro-fused. (The store-address can't use port7, though).
mov [buf +rax*4], edi SnB: unlaminated. HSW/SKL: stays micro-fused.
# normal ALU stuff
add edx, [rsp+rsi] SnB: unlaminated. HSW/SKL: stays micro-fused.
# I assume the majority of traditional/normal ALU insns are like add
Run Code Online (Sandbox Code Playgroud)
Three-input instructions that HSW/SKL may have to un-laminate
vfmadd213ps xmm0,xmm0,[rel buf] HSW/SKL: stays micro-fused: 1 fused, 2 unfused.
vfmadd213ps xmm0,xmm0,[rdi] HSW/SKL: stays micro-fused
vfmadd213ps xmm0,xmm0,[0+rdi*4] HSW/SKL: un-laminated: 2 uops in fused & unfused-domains.
(So indexed addressing mode is still the condition for HSW/SKL, same as documented by Intel for SnB)
# no idea why this one-source BMI2 instruction is unlaminated
# It's different from ADD in that its destination is write-only (and it uses a VEX encoding)
blsi edi, [rdi] HSW/SKL: 1 fused-domain, 2 unfused.
blsi edi, [rdi+rsi] HSW/SKL: 2 fused & unfused-domain.
adc eax, [rdi] same as cmov r, [rdi]
cmove ebx, [rdi] Stays micro-fused. (SnB?)/HSW: 2 fused-domain, 3 unfused domain.
SKL: 1 fused-domain, 2 unfused.
# I haven't confirmed that this micro-fuses in the decoders, but I'm assuming it does since a one-register addressing mode does.
adc eax, [rdi+rsi] same as cmov r, [rdi+rsi]
cmove ebx, [rdi+rax] SnB: untested, probably 3 fused&unfused-domain.
HSW: un-laminated to 3 fused&unfused-domain.
SKL: un-laminated to 2 fused&unfused-domain.
Run Code Online (Sandbox Code Playgroud)
I assume that Broadwell behaves like Skylake for adc/cmov.
奇怪的是,HSW取消层叠存储器源ADC和CMOV.也许英特尔没有在他们到达Haswell发货的最后期限之前改变它从SnB.
Agner的insn表在HSW/SKL上表示cmovcc r,m并且adc r,m根本没有微融合,但这与我的实验不符.我测量的周期计数与融合域uop问题计数匹配,导致4 uops/clock问题瓶颈.希望他能仔细检查并更正表格.
Memory-dest整数ALU:
add [rdi], eax SnB: untested (Agner says 2 fused-domain, 4 unfused-domain (load + ALU + store-address + store-data)
HSW/SKL: 2 fused-domain, 4 unfused.
add [rdi+rsi], eax SnB: untested, probably 4 fused & unfused-domain
HSW/SKL: 3 fused-domain, 4 unfused. (I don't know which uop stays fused).
HSW: About 0.95 cycles extra store-forwarding latency vs. [rdi] for the same address used repeatedly. (6.98c per iter, up from 6.04c for [rdi])
SKL: 0.02c extra latency (5.45c per iter, up from 5.43c for [rdi]), again in a tiny loop with dec ecx/jnz
adc [rdi], eax SnB: untested
HSW: 4 fused-domain, 6 unfused-domain. (same-address throughput 7.23c with dec, 7.19c with sub ecx,1)
SKL: 4 fused-domain, 6 unfused-domain. (same-address throughput ~5.25c with dec, 5.28c with sub)
adc [rdi+rsi], eax SnB: untested
HSW: 5 fused-domain, 6 unfused-domain. (same-address throughput = 7.03c)
SKL: 5 fused-domain, 6 unfused-domain. (same-address throughput = ~5.4c with sub ecx,1 for the loop branch, or 5.23c with dec ecx for the loop branch.)
Run Code Online (Sandbox Code Playgroud)
Yes, that's right, adc [rdi],eax/dec ecx/jnz runs faster than the same loop with add instead of adc on SKL. I didn't try using different addresses, since clearly SKL doesn't like repeated rewrites of the same address (store-forwarding latency higher than expected. See also this post about repeated store/reload to the same address being slower than expected on SKL.
Memory-destination adc is so many uops because Intel P6-family (and apparently SnB-family) can't keep the same TLB entries for all the uops of a multi-uop instruction, so it needs an extra uop to work around the problem-case where the load and add complete, and then the store faults, but the insn can't just be restarted because CF has already been updated. Interesting series of comments from Andy Glew (@krazyglew).
Presumably fusion in the decoders and un-lamination later saves us from needing microcode ROM to produce more than 4 fused-domain uops from a single instruction for adc [base+idx], reg.
Why SnB-family un-laminates:
Sandybridge simplified the internal uop format to save power and transistors (along with making the major change to using a physical register file, instead of keeping input/output data in the ROB). SnB-family CPUs only allow a limited number of input registers for a fused-domain uop in the out-of-order core. For SnB/IvB, that limit is 2 inputs (including flags). For HSW and later, the limit is 3 inputs for a uop. I'm not sure if memory-destination add and adc are taking full advantage of that, or if Intel had to get Haswell out the door with some instructions
Nehalem and earlier have a limit of 2 inputs for an unfused-domain uop, but the ROB can apparently track micro-fused uops with 3 input registers (the non-memory register operand, base, and index).
So indexed stores and ALU+load instructions can still decode efficiently (not having to be the first uop in a group), and don't take extra space in the uop cache, but otherwise the advantages of micro-fusion are essentially gone for tuning tight loops. "un-lamination" happens before the 4-fused-domain-uops-per-cycle issue/retire width out-of-order core. The fused-domain performance counters (uops_issued/uops_retired.retire_slots) count fused-domain uops after un-lamination.
Intel's description of the renamer (Section 2.3.3.1: Renamer) implies that it's the issue/rename stage which actually does the un-lamination, so uops destined for un-lamination may still be micro-fused in the 28/56/64 fused-domain uop issue queue/loop-buffer (aka the IDQ).
TODO: test this. Make a loop that should just barely fit in the loop buffer. Change something so one of the uops will be un-laminated before issuing, and see if it still runs from the loop buffer (LSD), or if all the uops are now re-fetched from the uop cache (DSB). There are perf counters to track where uops come from, so this should be easy.
Harder TODO: if un-lamination happens between reading from the uop cache and adding to the IDQ, test whether it can ever reduce uop-cache bandwidth. Or if un-lamination happens right at the issue stage, can it hurt issue throughput? (i.e. how does it handle the leftover uops after issuing the first 4.)
(See the a previous version of this answer for some guesses based on tuning some LUT code, with some notes on vpgatherdd being about 1.7x more cycles than a pinsrw loop.)
The HSW/SKL numbers were measured on an i5-4210U and an i7-6700k. Both had HT enabled (but the system idle so the thread had the whole core to itself). I ran the same static binaries on both systems, Linux 4.10 on SKL and Linux 4.8 on HSW, using ocperf.py. (The HSW laptop NFS-mounted my SKL desktop's /home.)
The SnB numbers were measured as described below, on an i5-2500k which is no longer working.
Confirmed by testing with performance counters for uops and cycles.
I found a table of PMU events for Intel Sandybridge, for use with Linux's perf command. (Standard perf unfortunately doesn't have symbolic names for most hardware-specific PMU events, like uops.) I made use of it for a recent answer.
ocperf.py provides symbolic names for these uarch-specific PMU events, so you don't have to look up tables. Also, the same symbolic name works across multiple uarches. I wasn't aware of it when I first wrote this answer.
To test for uop micro-fusion, I constructed a test program that is bottlenecked on the 4-uops-per-cycle fused-domain limit of Intel CPUs. To avoid any execution-port contention, many of these uops are nops, which still sit in the uop cache and go through the pipeline the same as any other uop, except they don't get dispatched to an execution port. (An xor x, same, or an eliminated move, would be the same.)
Test program: yasm -f elf64 uop-test.s && ld uop-test.o -o uop-test
GLOBAL _start
_start:
xor eax, eax
xor ebx, ebx
xor edx, edx
xor edi, edi
lea rsi, [rel mydata] ; load pointer
mov ecx, 10000000
cmp dword [rsp], 2 ; argc >= 2
jge .loop_2reg
ALIGN 32
.loop_1reg:
or eax, [rsi + 0]
or ebx, [rsi + 4]
dec ecx
nop
nop
nop
nop
jg .loop_1reg
; xchg r8, r9 ; no effect on flags; decided to use NOPs instead
jmp .out
ALIGN 32
.loop_2reg:
or eax, [rsi + 0 + rdi]
or ebx, [rsi + 4 + rdi]
dec ecx
nop
nop
nop
nop
jg .loop_2reg
.out:
xor edi, edi
mov eax, 231 ; exit(0)
syscall
SECTION .rodata
mydata:
db 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
Run Code Online (Sandbox Code Playgroud)
I also found that the uop bandwidth out of the loop buffer isn't a constant 4 per cycle, if the loop isn't a multiple of 4 uops. (i.e. it's abc, abc, ...; not abca, bcab, ...). Agner Fog's microarch doc unfortunately wasn't clear on this limitation of the loop buffer. See Is performance reduced when executing loops whose uop count is not a multiple of processor width? for more investigation on HSW/SKL. SnB may be worse than HSW in this case, but I'm not sure and don't still have working SnB hardware.
I wanted to keep macro-fusion (compare-and-branch) out of the picture, so I used nops between the dec and the branch. I used 4 nops, so with micro-fusion, the loop would be 8 uops, and fill the pipeline with at 2 cycles per 1 iteration.
In the other version of the loop, using 2-operand addressing modes that don't micro-fuse, the loop will be 10 fused-domain uops, and run in 3 cycles.
Results from my 3.3GHz Intel Sandybridge (i5 2500k). I didn't do anything to get the cpufreq governor to ramp up clock speed before testing, because cycles are cycles when you aren't interacting with memory. I've added annotations for the performance counter events that I had to enter in hex.
testing the 1-reg addressing mode: no cmdline arg
$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,r2c2,r1c2,stalled-cycles-frontend,stalled-cycles-backend ./uop-test
Performance counter stats for './uop-test':
11.489620 task-clock (msec) # 0.961 CPUs utilized
20,288,530 cycles # 1.766 GHz
80,082,993 instructions # 3.95 insns per cycle
# 0.00 stalled cycles per insn
60,190,182 r1b1 ; UOPS_DISPATCHED: (unfused-domain. 1->umask 02 -> uops sent to execution ports from this thread)
80,203,853 r10e ; UOPS_ISSUED: fused-domain
80,118,315 r2c2 ; UOPS_RETIRED: retirement slots used (fused-domain)
100,136,097 r1c2 ; UOPS_RETIRED: ALL (unfused-domain)
220,440 stalled-cycles-frontend # 1.09% frontend cycles idle
193,887 stalled-cycles-backend # 0.96% backend cycles idle
0.011949917 seconds time elapsed
Run Code Online (Sandbox Code Playgroud)
testing the 2-reg addressing mode: with a cmdline arg
$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,r2c2,r1c2,stalled-cycles-frontend,stalled-cycles-backend ./uop-test x
Performance counter stats for './uop-test x':
18.756134 task-clock (msec) # 0.981 CPUs utilized
30,377,306 cycles # 1.620 GHz
80,105,553 instructions # 2.64 insns per cycle
# 0.01 stalled cycles per insn
60,218,693 r1b1 ; UOPS_DISPATCHED: (unfused-domain. 1->umask 02 -> uops sent to execution ports from this thread)
100,224,654 r10e ; UOPS_ISSUED: fused-domain
100,148,591 r2c2 ; UOPS_RETIRED: retirement slots used (fused-domain)
100,172,151 r1c2 ; UOPS_RETIRED: ALL (unfused-domain)
307,712 stalled-cycles-frontend # 1.01% frontend cycles idle
1,100,168 stalled-cycles-backend # 3.62% backend cycles idle
0.019114911 seconds time elapsed
Run Code Online (Sandbox Code Playgroud)
So, both versions ran 80M instructions, and dispatched 60M uops to execution ports. (or with a memory source dispatches to an ALU for the or, and a load port for the load, regardless of whether it was micro-fused or not in the rest of the pipeline. nop doesn't dispatch to an execution port at all.) Similarly, both versions retire 100M unfused-domain uops, because the 40M nops count here.
The difference is in the counters for the fused-domain.
I suspect that you'd only see a difference between UOPS_ISSUED and UOPS_RETIRED(retirement slots used) if branch mispredicts led to uops being cancelled after issue, but before retirement.
And finally, the performance impact is real. The non-fused version took 1.5x as many clock cycles. This exaggerates the performance difference compared to most real cases. The loop has to run in a whole number of cycles, and the 2 extra uops push it from 2 to 3. Often, an extra 2 fused-domain uops will make less difference. And potentially no difference, if the code is bottlecked by something other than 4-fused-domain-uops-per-cycle.
Still, code that makes a lot of memory references in a loop might be faster if implemented with a moderate amount of unrolling and incrementing multiple pointers which are used with simple [base + immediate offset] addressing, instead of the using [base + index] addressing modes.
RIP-relative with an immediate can't micro-fuse. Agner Fog's testing shows that this is the case even in the decoders/uop-cache, so they never fuse in the first place (rather than being un-laminated).
IACA gets this wrong, and claims that both of these micro-fuse:
cmp dword [abs mydata], 0x1b ; fused counters != unfused counters (micro-fusion happened, and wasn't un-laminated). Uses 2 entries in the uop-cache, according to Agner Fog's testing
cmp dword [rel mydata], 0x1b ; fused counters ~= unfused counters (micro-fusion didn't happen)
Run Code Online (Sandbox Code Playgroud)
RIP-rel does micro-fuse (and stay fused) when there's no immediate, e.g.:
or eax, dword [rel mydata] ; fused counters != unfused counters, i.e. micro-fusion happens
Run Code Online (Sandbox Code Playgroud)
Micro-fusion doesn't increase the latency of an instruction. The load can issue before the other input is ready.
ALIGN 32
.dep_fuse:
or eax, [rsi + 0]
or eax, [rsi + 0]
or eax, [rsi + 0]
or eax, [rsi + 0]
or eax, [rsi + 0]
dec ecx
jg .dep_fuse
Run Code Online (Sandbox Code Playgroud)
This loop runs at 5 cycles per iteration, because of the eax dep chain. No faster than a sequence of or eax, [rsi + 0 + rdi], or mov ebx, [rsi + 0 + rdi] / or eax, ebx. (The unfused and the mov versions both run the same number of uops.) Scheduling/dep checking happens in the unfused-domain. Newly issued uops go into the scheduler (aka Reservation Station (RS)) as well as the ROB. They leave the scheduler after dispatching (aka being sent to an execution unit), but stay in the ROB until retirement. So the out-of-order window for hiding load latency is at least the scheduler size (54 unfused-domain uops in Sandybridge, 60 in Haswell, 97 in Skylake).
Micro-fusion doesn't have a shortcut for the base and offset being the same register. A loop with or eax, [mydata + rdi+4*rdi] (where rdi is zeroed) runs as many uops and cycles as the loop with or eax, [rsi+rdi]. This addressing mode could be used for iterating over an array of odd-sized structs starting at a fixed address. This is probably never used in most programs, so it's no surprise that Intel didn't spend transistors on allowing this special-case of 2-register modes to micro-fuse. (And Intel documents it as "indexed addressing modes" anyway, where a register and scale factor are needed.)
Macro-fusion of a cmp/jcc or dec/jcc creates a uop that stays as a single uop even in the unfused-domain. dec / nop / jge can still run in a single cycle but is three uops instead of one.
注意:自从我写下这个答案以来,Peter也对Haswell和Skylake进行了测试,并将结果整合到上面接受的答案中(特别是,我在下面提到的Skylake的大部分改进似乎实际上都出现在Haswell中).您应该看到这个问题的答案对于在CPU行为的破败和这个答案(虽然不是错误的)主要是历史的兴趣.
我的测试表明,在SKYLAKE微架构至少1,处理器完全融合,即使复杂的寻址方式,不同的Sandy Bridge.
也就是说,Peter上面发布的代码的1-arg和2-arg版本以相同的周期数运行,并且调度和退出的uop数相同.
我的结果:
绩效计数器统计数据./uop-test:
23.718772 task-clock (msec) # 0.973 CPUs utilized
20,642,233 cycles # 0.870 GHz
80,111,957 instructions # 3.88 insns per cycle
60,253,831 uops_executed_thread # 2540.344 M/sec
80,295,685 uops_issued_any # 3385.322 M/sec
80,176,940 uops_retired_retire_slots # 3380.316 M/sec
0.024376698 seconds time elapsed
Run Code Online (Sandbox Code Playgroud)
绩效计数器统计数据./uop-test x:
13.532440 task-clock (msec) # 0.967 CPUs utilized
21,592,044 cycles # 1.596 GHz
80,073,676 instructions # 3.71 insns per cycle
60,144,749 uops_executed_thread # 4444.487 M/sec
80,162,360 uops_issued_any # 5923.718 M/sec
80,104,978 uops_retired_retire_slots # 5919.478 M/sec
0.013997088 seconds time elapsed
Run Code Online (Sandbox Code Playgroud)
绩效计数器统计数据./uop-test x x:
16.672198 task-clock (msec) # 0.981 CPUs utilized
27,056,453 cycles # 1.623 GHz
80,083,140 instructions # 2.96 insns per cycle
60,164,049 uops_executed_thread # 3608.645 M/sec
100,187,390 uops_issued_any # 6009.249 M/sec
100,118,409 uops_retired_retire_slots # 6005.112 M/sec
0.016997874 seconds time elapsed
Run Code Online (Sandbox Code Playgroud)
我没有在Skylake上找到任何UOPS_RETIRED_ANY指令,只有"退役老虎机"显然是融合域.
最终的测试(uop-test x x)是彼得建议使用cmp与立即相关的RIP相关的变体,其已知不会被微观化:
.loop_riprel
cmp dword [rel mydata], 1
cmp dword [rel mydata], 2
dec ecx
nop
nop
nop
nop
jg .loop_riprel
Run Code Online (Sandbox Code Playgroud)
结果表明,每个循环额外的2个uop由uops发出和退出的计数器拾取(因此测试可以区分融合发生,而不是).
欢迎对其他架构进行更多测试!您可以在github中找到代码(从上面的Peter复制).
[1] ......也许是Skylake和Sandybridge之间的其他一些架构,因为Peter只测试了SB而我只测试了SKL.
我现在已经审查了英特尔Sandy Bridge,Ivy Bridge,Haswell和Broadwell的测试结果.我还没有访问过Skylake的测试.结果是:
您的结果可能是由于其他因素造成的.我没有尝试过使用IACA.
| 归档时间: |
|
| 查看次数: |
4504 次 |
| 最近记录: |