选择性地使用AVX2指令对列表中的元素进行排序

War*_*ens 3 optimization x86 simd avx avx2

我想用AVX2指令加快以下操作,但我无法找到一种方法.

我得到了uint64_t data[100000]一大堆uint64_t和一个unsigned char indices[100000]字节数组.我想输出一个数组uint64_t Out[256],其中第i个值是所有data[j]这样的xor index[j]=i.

我想要的直接实现是这样的:

uint64_t Out[256] = {0};     // initialize output array
for (i = 0; i < 100000 ; i++) {
    Out[Indices[i]] ^= data[i];
}
Run Code Online (Sandbox Code Playgroud)

我们可以使用AVX2指令更有效地实现这一点吗?

编辑:这是我的代码现在的样子

uint64_t Out[256][4] = {0};   // initialize output array
for (i = 0; i < 100000 ; i+=4) {
    Out[Indices[i  ]][0] ^= data[i];
    Out[Indices[i+1]][1] ^= data[i+1];
    Out[Indices[i+2]][2] ^= data[i+2];
    Out[Indices[i+3]][3] ^= data[i+3];
}
Run Code Online (Sandbox Code Playgroud)

Pet*_*des 6

Based on static analysis for Haswell/Skylake, I came up with a version that runs in ~5 cycles per 4 i values, instead of 8 cycles, when compiled by gcc. Average for large sizes, not including the time to combine multiple copies of Out[], and assuming a random distribution of Indices that doesn't lead to any store/reload dep chains running for long enough to matter.

IDK if you care about Ryzen or Excavator (the other 2 mainstream AVX2 microachitectures).

我没有手工仔细分析,但IACA对于HSW/SKL是错误的,并且认为某些指令实际上没有微熔合(在具有性能计数器的i7-6700k上测试),所以它认为前端瓶颈比实际更严重.例如,movhps加载+合并微型保险丝,但IACA认为它甚至没有简单的寻址模式.

我们应该忽略任何缓存未命中,因为uint64_t Out[4][256]只有8kiB.因此,我们的缓存占用空间仅为最新CPU上L1d大小的1/4,即使超线程在两个逻辑线程之间共享L1d,也应该大致正常.循环data[],Indices[]应该预先好好,并希望不会逐出Out[].因此,静态分析很有可能具有一定的准确性,并且比仔细的微基准测试更快,更重要的是告诉您瓶颈的确切含义.

但当然,我们在很大程度上依赖于无序执行和不完美的调度或其他意外的瓶颈很容易发生.不过,如果我没有得到报酬,我不觉得实际上是微支持标记.

我们可以使用AVX2指令更有效地实现这一点吗?

This is basically a histogram problem. The usual histogram optimization of using multiple tables and combining at the end applies. SIMD XOR is useful for the combine-at-the-end (as long as you use Out[4][256], not Out[256][4]. The latter also makes the indexing slower by requiring scaling by 8*4 instead of 8 (which can be done with a single LEA in a scaled-index addressing mode)).

But unlike a normal histogram, you're XORing in some data from memory instead of ADDing a constant 1. So instead of an immediate 1, the code has to load data[i] into a register as a source for xor. (Or load, then xor reg, data[i]/store). This is even more total memory operations than a histogram.

我们提前从"手动"收集/分散到SIMD向量(使用movq/ movhpsloads/stores),允许我们使用SIMD进行data[i]加载和异或.这减少了负载操作的总数,从而降低了负载端口压力,而无需花费额外的前端带宽.

手动收集到256位向量可能不值得额外的改组(额外的vinserti128/vextracti128,因此我们可以将2个内存源组合vpxor成一个256位的).128位向量应该是好的.前端吞吐量也是一个主要问题,因为(在Intel SnB系列CPU上)您希望避免存储的索引寻址模式.gcc使用lea指令计算寄存器中的地址,而不是使用索引的加载/存储.clang/LLVM -march=skylake决定不这样做,在这种情况下这是一个错误的决定,因为端口2 /端口3上的循环瓶颈,并且花费额外的ALU微操作以允许存储地址uops使用端口7是一个胜利.但如果你在p23上没有瓶颈,花费额外的uops来避免索引商店并不好.(和在可以保持微熔合的情况下,绝对不仅仅是为了避免索引负载; 傻gcc).也许gcc和LLVM的寻址模式成本模型不是很准确,或者它们没有对流水线进行足够详细的建模,以确定前端与特定端口之间的循环瓶颈.

选择寻址模式和其他asm代码选择对于在SnB系列上实现最佳性能至关重要.但用C语写作让你无法控制 ; 除非你可以调整源代码以使它做出不同的选择,否则你大多受编译器的支配.例如gcc vs. clang在这里有很大的不同.

On SnB-family, a movhps load needs port 5 for the shuffle/blend (although it does micro-fuse into one uop), but a movhps store is a pure store with no ALU uop. So it's break-even there, and lets us use one SIMD load/XOR for two data elements.

With AVX, unaligned memory source operands are allowed for ALU uops, so we don't need to require alignment for data[]. But Intel HSW/SKL can keep an indexed addressing mode micro-fused with pxor but not vpxor. So compiling without AVX enabled can be better, allowing the compiler to use an indexed addressing mode instead of incrementing a separate pointer. (Or making it faster if the compiler doesn't know this and uses an indexed addressing mode anyway.) TL:DR: probably best to require 16-byte aligned data[] and compile that function with AVX disabled, for better macro-fusion. (But then we miss out on 256-bit SIMD for combining the Out slices at the end, unless we put that in a different function compiled with AVX or AVX2)

避免未对齐的负载也将避免任何缓存线分裂,这不会花费额外的uops,但我们可能接近于L1d吞吐量限制的瓶颈,而不仅仅是加载/存储执行单元吞吐量限制.


我还看了一次加载4个索引并用ALU指令解压缩.例如memcpy进入struct { uint8_t idx[4]; } idx;.但gcc会生成多个浪费的指令,用于解压缩.太糟糕了,x86没有很好的位域指令,比如ARM ubfx或者特别是 PowerPC rlwinm(它可以让结果左移免费,所以如果x86有这个,静态Out就可以在非PIC代码中使用base + disp32寻址模式. )

Unpacking a dword with shift/movzx from AL/AH is a win if we're using scalar XOR, but it looks like it's not when we're using SIMD for data[] and spending front-end throughput on lea instructions to allow store-address uops to run on port 7. That makes us front-end bottlenecked instead of port2/3 bottlenecked, so using 4x movzx loads from memory looks best according to static analysis. Would be worth benchmarking both ways if you take the time to hand-edit the asm. (The gcc-generated asm with extra uops is just bad, including a completely redundant movzx after right-shifting by 24, leaving the upper bits already zero.)


The code

(See it on the Godbolt compiler explorer, along with a scalar version):

#include <immintrin.h>
#include <stdint.h>
#include <string.h>
#include <stdalign.h>

#ifdef IACA_MARKS
#include "/opt/iaca-3.0/iacaMarks.h"
#else
#define IACA_START
#define IACA_END
#endif

void hist_gatherscatter(unsigned idx0, unsigned idx1,
                       uint64_t Out0[256], uint64_t Out1[256],
                       __m128i vdata) {
    // gather load from Out[0][?] and Out[1][?] with movq / movhps
    __m128i hist = _mm_loadl_epi64((__m128i*)&Out0[idx0]);
    hist = _mm_castps_si128(   // movhps into the high half
               _mm_loadh_pi(_mm_castsi128_ps(hist), (__m64*)&Out1[idx1]));

    // xorps could bottleneck on port5.
    // Actually probably not, using __m128 the whole time would be simpler and maybe not confuse clang
    hist = _mm_xor_si128(hist, vdata);

    // scatter store with movq / movhps
    _mm_storel_epi64((__m128i*)&Out0[idx0], hist);
    _mm_storeh_pi((__m64*)&Out1[idx1], _mm_castsi128_ps(hist));
}

void ext(uint64_t*);

void xor_histo_avx(uint8_t *Indices, const uint64_t *data, size_t len)
{
    alignas(32) uint64_t Out[4][256] = {{0}};

    // optional: peel the first iteration and optimize away loading the old known-zero values from Out[0..3][Indices[0..3]].

    if (len<3)   // not shown: cleanup for last up-to-3 elements.
        return;

    for (size_t i = 0 ; i<len ; i+=4) {
        IACA_START
        // attempt to hand-hold compiler into a dword load + shifts to extract indices
        // to reduce load-port pressure
        struct { uint8_t idx[4]; } idx;
#if 0
        memcpy(&idx, Indices+i, sizeof(idx));  // safe with strict-aliasing and possibly-unaligned
   //gcc makes stupid asm for this, same as for memcpy into a struct,
   // using a dword load into EAX (good),
   // then AL/AH for the first 2 (good)
   // but then redundant mov and movzx instructions for the high 2

   // clang turns it into 4 loads

/*
     //Attempt to hand-hold gcc into less-stupid asm
     //doesn't work: same asm as the struct
        uint32_t tmp;
        memcpy(&tmp, Indices+i, sizeof(tmp));  // mov eax,[mem]
        idx.idx[0] = tmp;     //movzx reg, AL
        idx.idx[1] = tmp>>8;  //movzx reg, AH
        tmp >>= 16;           //shr   eax, 16
        idx.idx[2] = tmp;     //movzx reg, AL
        idx.idx[3] = tmp>>8;  //movzx reg, AH
*/
#else
       // compiles to separate loads with gcc and clang
        idx.idx[0] = Indices[i+0];
        idx.idx[1] = Indices[i+1];
        idx.idx[2] = Indices[i+2];
        idx.idx[3] = Indices[i+3];
#endif

        __m128i vd = _mm_load_si128((const __m128i*)&data[i]);
        hist_gatherscatter(idx.idx[0], idx.idx[1], Out[0], Out[1], vd);

        vd = _mm_load_si128((const __m128i*)&data[i+2]);
        hist_gatherscatter(idx.idx[2], idx.idx[3], Out[2], Out[3], vd);
    }
    IACA_END


   // hand-hold compilers into a pointer-increment loop
   // to avoid indexed addressing modes.  (4/5 speedup on HSW/SKL if all the stores use port7)
    __m256i *outp = (__m256i*)&Out[0];
    __m256i *endp = (__m256i*)&Out[3][256];
    for (; outp < endp ; outp++) {
        outp[0] ^= outp[256/4*1];
        outp[0] ^= outp[256/4*2];
        outp[0] ^= outp[256/4*3];
    }
    // This part compiles horribly with -mno-avx, but does compile
    // because I used GNU C native vector operators on __m256i instead of intrinsics.

/*
    for (int i=0 ; i<256 ; i+=4) {
        // use loadu / storeu if Out isn't aligned
        __m256i out0 = _mm256_load_si256(&Out[0][i]);
        __m256i out1 = _mm256_load_si256(&Out[1][i]);
        __m256i out2 = _mm256_load_si256(&Out[2][i]);
        __m256i out3 = _mm256_load_si256(&Out[3][i]);
        out0 = _mm256_xor_si256(out0, out1);
        out0 = _mm256_xor_si256(out0, out2);
        out0 = _mm256_xor_si256(out0, out3);
        _mm256_store_si256(&Out[0][i], out0);
    }
*/

    //ext(Out[0]);  // prevent optimizing away the work
    asm("" :: "r"(Out) : "memory");
}
Run Code Online (Sandbox Code Playgroud)

用gcc7.3编译-std=gnu11 -DIACA_MARKS -O3 -march=skylake -mno-avx,并用IACA-3.0进行分析:

$ /opt/iaca-3.0/iaca xor-histo.iaca.o                                                                             Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;16:42:45
Analyzed File -  xor-histo.iaca.o
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 5.79 Cycles       Throughput Bottleneck: FrontEnd
Loop Count:  22 (this is fused-domain uops.  It's actually 20, so a 5 cycle front-end bottleneck)
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  2.0     0.0  |  3.0  |  5.5     5.1  |  5.5     4.9  |  4.0  |  3.0  |  2.0  |  3.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movzx r8d, byte ptr [rdi]
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movzx edx, byte ptr [rdi+0x2]
|   1      |             |      |             |             |      |      | 1.0  |      | add rdi, 0x4
|   1      |             |      |             |             |      |      | 1.0  |      | add rsi, 0x20
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movzx eax, byte ptr [rdi-0x1]
|   1      |             | 1.0  |             |             |      |      |      |      | lea r12, ptr [rcx+r8*8]
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movzx r8d, byte ptr [rdi-0x3]
|   1      |             | 1.0  |             |             |      |      |      |      | lea rdx, ptr [r10+rdx*8]
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movq xmm0, qword ptr [r12]
|   1      |             |      |             |             |      | 1.0  |      |      | lea rax, ptr [r9+rax*8]
|   1      |             | 1.0  |             |             |      |      |      |      | lea r8, ptr [r11+r8*8]
|   2      |             |      | 0.5     0.5 | 0.5     0.5 |      | 1.0  |      |      | movhps xmm0, qword ptr [r8]   # Wrong, 1 micro-fused uop on SKL
|   2^     | 1.0         |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | pxor xmm0, xmmword ptr [rsi-0x20]
|   2^     |             |      | 0.5         | 0.5         | 1.0  |      |      |      | movq qword ptr [r12], xmm0   # can run on port 7, IDK why IACA chooses not to model it there
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | movhps qword ptr [r8], xmm0
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movq xmm0, qword ptr [rdx]
|   2      |             |      | 0.5     0.5 | 0.5     0.5 |      | 1.0  |      |      | movhps xmm0, qword ptr [rax]  # Wrong, 1 micro-fused uop on SKL
|   2^     | 1.0         |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | pxor xmm0, xmmword ptr [rsi-0x10]
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | movq qword ptr [rdx], xmm0
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | movhps qword ptr [rax], xmm0
|   1*     |             |      |             |             |      |      |      |      | cmp rbx, rdi
|   0*F    |             |      |             |             |      |      |      |      | jnz 0xffffffffffffffa0
Total Num Of Uops: 29  (This is unfused-domain, and a weird thing to total up).
Run Code Online (Sandbox Code Playgroud)

Godbolt上的gcc8.1使用缩放索引寻址模式pxor,使用相同的Indices计数器data[],以便保存add.

clang不使用LEA,并且i每7个周期4 个瓶颈,因为没有任何商店uops可以在端口7上运行.

标量版本(仍然使用4片Out[4][256]):

$ iaca.sh -mark 2 xor-histo.iaca.o                               
Intel(R) Architecture Code Analyzer Version - 2.3 build:246dfea (Thu, 6 Jul 2017 13:38:05 +0300)
Analyzed File - xor-histo.iaca.o
Binary Format - 64Bit
Architecture  - SKL
Analysis Type - Throughput

*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 2
*******************************************************************

Throughput Analysis Report
--------------------------
Block Throughput: 7.24 Cycles       Throughput Bottleneck: FrontEnd

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 3.0    0.0  | 3.0  | 6.2    4.5  | 6.8    4.5  | 4.0  | 3.0  | 3.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov eax, dword ptr [rdi]
|   1    | 0.4       | 0.5 |           |           |     | 0.1 |     |     |    | add rdi, 0x4
|   1    |           | 0.7 |           |           |     | 0.3 |     |     |    | add rsi, 0x20
|   1*   |           |     |           |           |     |     |     |     |    | movzx r9d, al
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov rdx, qword ptr [rbp+r9*8-0x2040]
|   2^   |           | 0.3 | 0.5   0.5 | 0.5   0.5 |     | 0.3 | 0.4 |     |    | xor rdx, qword ptr [rsi-0x20]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     |    | mov qword ptr [rbp+r9*8-0x2040], rdx  # wrong, HSW/SKL can keep indexed stores fused
|   1*   |           |     |           |           |     |     |     |     |    | movzx edx, ah
|   1    |           |     |           |           |     | 0.4 | 0.6 |     |    | add rdx, 0x100
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov r9, qword ptr [rbp+rdx*8-0x2040]
|   2^   | 0.6       | 0.2 | 0.5   0.5 | 0.5   0.5 |     | 0.2 | 0.1 |     |    | xor r9, qword ptr [rsi-0x18]
|   2    |           |     | 0.2       | 0.8       | 1.0 |     |     |     |    | mov qword ptr [rbp+rdx*8-0x2040], r9  # wrong, HSW/SKL can keep indexed stores fused
|   1*   |           |     |           |           |     |     |     |     |    | mov edx, eax   # gcc code-gen isn't great, but not as bad as in the SIMD loop.  No extra movzx, but not taking advantage of AL/AH
|   1    | 0.4       |     |           |           |     |     | 0.6 |     |    | shr eax, 0x18
|   1    | 0.8       |     |           |           |     |     | 0.2 |     |    | shr edx, 0x10
|   1    |           | 0.6 |           |           |     | 0.3 |     |     |    | add rax, 0x300
|   1*   |           |     |           |           |     |     |     |     |    | movzx edx, dl
|   1    | 0.2       | 0.1 |           |           |     | 0.5 | 0.2 |     |    | add rdx, 0x200
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov r9, qword ptr [rbp+rdx*8-0x2040]
|   2^   |           | 0.6 | 0.5   0.5 | 0.5   0.5 |     | 0.3 | 0.1 |     |    | xor r9, qword ptr [rsi-0x10]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     |    | mov qword ptr [rbp+rdx*8-0x2040], r9  # wrong, HSW/SKL can keep indexed stores fused
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov rdx, qword ptr [rbp+rax*8-0x2040]
|   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 0.6 | 0.4 |     |    | xor rdx, qword ptr [rsi-0x8]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     |    | mov qword ptr [rbp+rax*8-0x2040], rdx  # wrong, HSW/SKL can keep indexed stores fused
|   1    | 0.6       |     |           |           |     |     | 0.4 |     |    | cmp r8, rdi
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffff75
Total Num Of Uops: 33
Run Code Online (Sandbox Code Playgroud)

该循环是4个融合域uop,比IACA计数的短,因为它不知道只有SnB/IvB非层压索引存储.HSW/SKL没有.但是,这样的商店仍然不能使用端口7,因此对于4个元素,这不会比~6.5个周期更好.

(And BTW, with naive handling of Indices[i], loading each one separately with movzx, you get 8 cycles for 4 elements, saturating ports 2 and 3. Even though gcc doesn't generate throughput-optimal code for unpacking the struct, the 4-byte load + unpack should be a net win by relieving some load-port pressure.)


Cleanup loop:

AVX2 really shines here: we loop over the lowest slice of the histogram, and XOR in the other slices. This loop is 8 front-end uops with 4 loads on Skylake, and should run at 1 iter per 2 clocks:

.L7:
    vmovdqa ymm2, YMMWORD PTR [rax+4096]
    vpxor   ymm0, ymm2, YMMWORD PTR [rax+6144]
    vmovdqa ymm3, YMMWORD PTR [rax]
    vpxor   ymm1, ymm3, YMMWORD PTR [rax+2048]
    vpxor   ymm0, ymm0, ymm1
    vmovdqa YMMWORD PTR [rax], ymm0
    add     rax, 32
    cmp     rax, rdx
    jne     .L7
Run Code Online (Sandbox Code Playgroud)

I tried to reduce the uop count further by doing the XORs in one chain, but gcc insists on doing two vmovdqa loads and having to do one vpxor without a memory operand. (OoO exec will hide the latency of this tiny chain/tree of VPXOR so it doesn't matter.)


How would I use the scattering with AVX-512? Is there some scatters instruction that xors instead of overwrites?

No, you'd use a gather to get the old values, then SIMD XOR, then scatter the updated elements back to the locations they came from.

To avoid conflicts, you might want out[8][256] so every vector element can use a different table. (Otherwise you have a problem if Indices[i+0] and Indices[i+4] were equal, because the scatter store would just store the highest vector element with that index.

Scatter/gather instructions need a single base register, but you can simply add _mm256_setr_epi64(0, 256, 256*2, ...); after doing a vpmovzxbq zero-extending load.


Notes

I used IACA2.3 for the scalar analysis because IACA3.0 seems to have removed the -mark option to choose which loop to analyze when you have multiple marks in one file. IACA3.0 didn't fix any of the ways that IACA2.3 is wrong about SKL's pipeline in this case.

  • 如果有一个 4-uop 操作,通过“rdx”或任何你想要的东西将 4 个 64 位值从“ymm”移动到“rax”,那就太好了。 (2认同)