Gav*_*ood 3 assembly intel avx avx2 iaca
我正在尝试为内存限制的矢量化循环确定性能基线.我是在具有AVX2指令的Intel Broadwell芯片上在32字节对齐环境中执行此操作.
基线循环一次使用8个YMM寄存器从一个位置加载并且非时间存储到另一个位置:
%define ptr
%define ymmword yword
%define SIZE 16777216*8 ;; array size >> LLC
align 32 ;; avx2 vector alignement
global _ls_01_opt
section .text
_ls_01_opt: ;rdi is input, rsi output
push rbp
mov rbp,rsp
xor rax,rax
mov ebx, 111 ; IACA PREFIX
db 0x64, 0x67, 0x90 ;
LOOP0:
vmovapd ymm0, ymmword ptr [ (32) + rdi +8*rax]
vmovapd ymm2, ymmword ptr [ (64) + rdi +8*rax]
vmovapd ymm4, ymmword ptr [ (96) + rdi +8*rax]
vmovapd ymm6, ymmword ptr [ (128) + rdi +8*rax]
vmovapd ymm8, ymmword ptr [ (160) + rdi +8*rax]
vmovapd ymm10, ymmword ptr [ (192) + rdi +8*rax]
vmovapd ymm12, ymmword ptr [ (224) + rdi +8*rax]
vmovapd ymm14, ymmword ptr [ (256) + rdi +8*rax]
vmovntpd ymmword ptr [ (32) + rsi +8*rax], ymm0
vmovntpd ymmword ptr [ (64) + rsi +8*rax], ymm2
vmovntpd ymmword ptr [ (96) + rsi +8*rax], ymm4
vmovntpd ymmword ptr [ (128) + rsi +8*rax], ymm6
vmovntpd ymmword ptr [ (160) + rsi +8*rax], ymm8
vmovntpd ymmword ptr [ (192) + rsi +8*rax], ymm10
vmovntpd ymmword ptr [ (224) + rsi +8*rax], ymm12
vmovntpd ymmword ptr [ (256) + rsi +8*rax], ymm14
add rax, (4*8)
cmp rax, SIZE
jne LOOP0
mov ebx, 222 ; IACA SUFFIX
db 0x64, 0x67, 0x90 ;
ret
Run Code Online (Sandbox Code Playgroud)
我用YASM组装它,然后用英特尔架构代码分析器(IACA)进行测试,它告诉我:
Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles Throughput Bottleneck: PORT2_AGU, PORT3_AGU, Port4
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 0.5 0.0 | 0.5 | 8.0 4.0 | 8.0 4.0 | 8.0 | 0.5 | 0.5 | 0.0 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm0, ymmword ptr [rdi+rax*8+0x20]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm2, ymmword ptr [rdi+rax*8+0x40]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm4, ymmword ptr [rdi+rax*8+0x60]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm6, ymmword ptr [rdi+rax*8+0x80]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm8, ymmword ptr [rdi+rax*8+0xa0]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm10, ymmword ptr [rdi+rax*8+0xc0]
| 1 | | | 1.0 1.0 | | | | | | CP | vmovapd ymm12, ymmword ptr [rdi+rax*8+0xe0]
| 1 | | | | 1.0 1.0 | | | | | CP | vmovapd ymm14, ymmword ptr [rdi+rax*8+0x100]
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x20], ymm0
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x40], ymm2
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x60], ymm4
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x80], ymm6
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xa0], ymm8
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xc0], ymm10
| 2 | | | 1.0 | | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0xe0], ymm12
| 2 | | | | 1.0 | 1.0 | | | | CP | vmovntpd ymmword ptr [rsi+rax*8+0x100], ymm14
| 1 | | 0.5 | | | | 0.5 | | | | add rax, 0x20
| 1 | 0.5 | | | | | | 0.5 | | | cmp rax, 0x8000000
| 0F | | | | | | | | | | jnz 0xffffffffffffff78
Run Code Online (Sandbox Code Playgroud)
我的印象是,我可以同时获得2x负载,并且在端口2和3上同时加载broadwell.为什么不发生这种情况?
谢谢
UPDATE
根据下面的建议,pd被替换为ps并且地址被合并到一个寄存器中,新代码如下所示:
%define ptr
%define ymmword yword
%define SIZE 16777216*8 ;; array size >> LLC
align 32 ;; avx2 vector alignement
global _ls_01_opt
section .text
_ls_01_opt: ;rdi is input, rsi output
push rbp
mov rbp,rsp
xor rax,rax
xor rbx,rbx
xor rcx,rcx
or rbx, rdi
or rcx, rsi
mov ebx, 111 ; IACA PREFIX
db 0x64, 0x67, 0x90 ;
LOOP0:
vmovaps ymm0, ymmword ptr [ (32) + rbx ]
vmovaps ymm2, ymmword ptr [ (64) + rbx ]
vmovaps ymm4, ymmword ptr [ (96) + rbx ]
vmovaps ymm6, ymmword ptr [ (128) + rbx ]
vmovaps ymm8, ymmword ptr [ (160) + rbx ]
vmovaps ymm10, ymmword ptr [ (192) + rbx ]
vmovaps ymm12, ymmword ptr [ (224) + rbx ]
vmovaps ymm14, ymmword ptr [ (256) + rbx ]
vmovntps ymmword ptr [ (32) + rcx], ymm0
vmovntps ymmword ptr [ (64) + rcx], ymm2
vmovntps ymmword ptr [ (96) + rcx], ymm4
vmovntps ymmword ptr [ (128) + rcx], ymm6
vmovntps ymmword ptr [ (160) + rcx], ymm8
vmovntps ymmword ptr [ (192) + rcx], ymm10
vmovntps ymmword ptr [ (224) + rcx], ymm12
vmovntps ymmword ptr [ (256) + rcx], ymm14
add rax, (4*8)
add rbx, (4*8*8)
add rcx, (4*8*8)
cmp rax, SIZE
jne LOOP0
mov ebx, 222 ; IACA SUFFIX
db 0x64, 0x67, 0x90 ;
ret
Run Code Online (Sandbox Code Playgroud)
然后IACA告诉我:
Throughput Analysis Report
--------------------------
Block Throughput: 8.00 Cycles Throughput Bottleneck: Port4
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.0 | 5.3 4.0 | 5.3 4.0 | 8.0 | 1.0 | 1.0 | 5.3 |
---------------------------------------------------------------------------------------
N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm0, ymmword ptr [rbx+0x20]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm2, ymmword ptr [rbx+0x40]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm4, ymmword ptr [rbx+0x60]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm6, ymmword ptr [rbx+0x80]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm8, ymmword ptr [rbx+0xa0]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm10, ymmword ptr [rbx+0xc0]
| 1 | | | 1.0 1.0 | | | | | | | vmovaps ymm12, ymmword ptr [rbx+0xe0]
| 1 | | | | 1.0 1.0 | | | | | | vmovaps ymm14, ymmword ptr [rbx+0x100]
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x20], ymm0
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x40], ymm2
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x60], ymm4
| 2^ | | | | | 1.0 | | | 1.0 | CP | vmovntps ymmword ptr [rcx+0x80], ymm6
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xa0], ymm8
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xc0], ymm10
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0xe0], ymm12
| 2^ | | | 0.3 | 0.3 | 1.0 | | | 0.3 | CP | vmovntps ymmword ptr [rcx+0x100], ymm14
| 1 | 1.0 | | | | | | | | | add rax, 0x20
| 1 | | 1.0 | | | | | | | | add rbx, 0x100
| 1 | | | | | | 1.0 | | | | add rcx, 0x100
| 1 | | | | | | | 1.0 | | | cmp rax, 0x8000000
| 0F | | | | | | | | | | jnz 0xffffffffffffff7a
Run Code Online (Sandbox Code Playgroud)
这告诉我,商店现在可以使用端口7作为地址并存储操作.IACA告诉我,由于将地址放到单个寄存器上的额外操作,"块吞吐量"仍然是8个操作.也许我这样做错了?
我仍然不明白为什么加载操作无法融合
port7上的store-AGU只能处理"简单"有效地址,因此您的商店也需要加载端口上的AGU.IACA确实显示您的负载实际上并未相互竞争; 这是竞争的商店.
请注意,MOVNT存储每个核心只有大约10个填充缓冲区,因此很快就会填满并成为瓶颈.
另请参见微融合和寻址模式.如果您使用单寄存器寻址模式,您的商店可能会微融合并减少融合域uop.
此外,我想这对于VEX编码指令无关紧要,但SSE pd版本需要额外的x86机器代码字节. clang倾向于movaps用于加载/存储,因为它更短,甚至在整数向量上.每个现有的CPU运行movaps/ movapd相同.所以我建议只使用vmovaps/ vmovntps.但它完全没有任何区别.只需少一个VEX前缀中的设置位.