小编Ale*_*ltz的帖子

为什么使用 AVX ymm(m256) 指令比 xmm(m128) 慢约 4 倍

我编写了乘以 arr1*arr2 并将结果保存到 arr3 的程序。

Pseudocode:
arr3[i]=arr1[i]*arr2[i]

Run Code Online (Sandbox Code Playgroud)

我想使用 AVX 指令。我有 m128 和 m256 指令的汇编代码（展开）。结果表明，使用 ymm 比 xmm 慢 4 倍。但为什么？如果延迟相同..

Mul_ASM_AVX proc ; (float* RCX=arr1, float* RDX=arr2, float* R8=arr3, int R9 = arraySize)

    push rbx

    vpxor xmm0, xmm0, xmm0 ; Zero the counters
    vpxor xmm1, xmm1, xmm1
    vpxor xmm2, xmm2, xmm2
    vpxor xmm3, xmm3, xmm3

    mov rbx, r9
    sar r9, 4       ; Divide the count by 16 for AVX
    jz MulResiduals ; If that's 0, then we have only scalar …

Run Code Online (Sandbox Code Playgroud)

x86 assembly sse avx amd-processor

Ale*_*ltz

2020 02-12

4
推荐指数

1
解决办法

303
查看次数