Why does does a NOP (as a 5th uop) speed up a 4 uop loop on Ice Lake?

Noa*_*oah 7 assembly x86-64 cpu-architecture micro-optimization

All benchmarks are done on: Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (ark)

Edit: I was not able to reproduce this on broadwell and @PeterCordes was unable to reproduce it on skylake

I was trying to benchmark different methods of doing integer min(a, b) but ran into some unexplained behavior that I've boiled down to the following benchmark:

#define BENCH_FUNC_ATTR __attribute__((aligned(64), noinline, noclone))

#define SIX_BYTES_COMPUTATION 1
#define WITH_NOP_BEFORE_DECL  0
#define BREAK_DEPENDENCY      0
void BENCH_FUNC_ATTR
bench() {
    uint64_t       start, end;
    const uint64_t N = 1000000;
    start            = _rdtsc();
    uint64_t v0, v1, dst, loop_cnt;
    asm volatile(
        "xorl %k[v0], %k[v0]\n\t"
        "movl $1, %k[v1]\n\t"
        "movl %[N], %k[loop_cnt]\n\t"
        ".p2align 6\n\t"
        "1:\n\t"
#if SIX_BYTES_COMPUTATION
        "xorl %k[loop_cnt], %k[v0]\n\t"
        "xorl %k[loop_cnt], %k[v1]\n\t"
        "movl %k[v0], %k[dst]\n\t"
#else
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
#endif
        ".p2align 4\n\t"
#if WITH_NOP_BEFORE_DECL
        "nop\n\t"
#endif
#if BREAK_DEPENDENCY
        "xorl %k[v0], %k[v0]\n\t"
        "xorl %k[v1], %k[v1]\n\t"
#endif
        // macro-fusion is NOT broken
        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        : [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ dst ] "=&r"(dst),
          [ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(N)
        : "cc", "memory");
    end = _rdtsc();

    double dif = end - start;
    dif /= N;
    printf(
        "SIX_BYTES_COMPUTATION - [%s], WITH_NOP_BEFORE_DECL - [%s], "
        "BREAK_DEPENDENCY - [%s]\n\t",
        SIX_BYTES_COMPUTATION ? "ON" : "OFF",
        WITH_NOP_BEFORE_DECL ? "ON" : "OFF", BREAK_DEPENDENCY ? "ON" : "OFF");

    printf("%.3lf \"Cycles\"\n", dif);
}

Run Code Online (Sandbox Code Playgroud)

Turning on WITH_NOP_BEFORE_DECL so that there is a nop before the decl + jnz causes a measureable performance improvement when SIX_BYTES_COMPUTATION is turned on though causes measurable performance degradation when SIX_BYTES_COMPUTATION is turned off.

Here are the numbers:

SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    2.080 "Cycles" <--- Just 6 nops

SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    2.363 "Cycles" <--- Performance degradation from previous

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    2.185 "Cycles" <--- Computation then decl + jnz

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    1.945 "Cycles" <--- Performance improvement from previous

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
    1.919 "Cycles" <--- Breaking dependencies has best performance

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [ON]
    2.046 "Cycles" <--- nop hurts performance when breaking dependencies

Run Code Online (Sandbox Code Playgroud)

It might have to do with the register file filling up? I was able to find one potentially interesting metric uops_issued.stall_cycles [Cycles when RAT does not issue Uops to RS for the thread] which has the following outputs:

SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    473,647      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    495,380      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    1,406,244      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    875,364      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
    647,297      uops_issued.stall_cycles

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [ON]
    501,015      uops_issued.stall_cycles

Run Code Online (Sandbox Code Playgroud)

It does seem to correspond with SIX_BYTES_COMPUTATION on and WITH_NOP_BEFORE_DECL on or off but I'm not sure 1) why a nop would save room in the register files.

I'm pretty sure its not an alignment issue because with the .p2align 4 between the first 6 bytes of the loop body and the decl + jnz the decl + jnz will be on a different 16 byte aligned region either way and it the performance difference is reversed depending on what is in the loop body (so if it where an alignment thing it shouldn't matter if loop body is nops or computation).

I was thinking it might to do with some dependency chain issue but because if I break the dependency on v0 and v1 at the end of the loop then WITH_NOP_BEFORE_DECL turned on causes a performance degradation. I am very likely wrong about that though because I have no idea why a nop before the end of the loop would affect any dependency issues.

It almost definitely does not have to do with port scheduling. I was thinking maybe something weird was going where the nop by chance lead to better scheduling but didn't get any different in uops on ports 1,2,5,6 with out or WITH_NOP_BEFORE_DECL on:

Instruction per port with SIX_BYTES_COMPUTATION on and WITH_NOP_BEFORE_DECL Off:

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF]
         1,147,196      uops_dispatched.port_0                                      
         1,114,665      uops_dispatched.port_1                                      
         1,138,238      uops_dispatched.port_5                                      
         1,266,212      uops_dispatched.port_6

Run Code Online (Sandbox Code Playgroud)

Instruction per port with SIX_BYTES_COMPUTATION on and WITH_NOP_BEFORE_DECL On:

SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON]
         1,177,092      uops_dispatched.port_0                                      
         1,081,734      uops_dispatched.port_1                                      
         1,103,314      uops_dispatched.port_5                                      
         1,296,546      uops_dispatched.port_6

Run Code Online (Sandbox Code Playgroud)

My leading theory is that there is some inefficiency in the register renaming process that is the performance bound without the nop and by luck the nop hides that issue but I am not at all confident in that.

Can anyone help me understand this behavior.

Edit: Full cpp code and new times w/ warmup and lfence before rdtsc.

New Code

#include <assert.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <x86intrin.h>
#include <type_traits>

#define BENCH_FUNC_ATTR __attribute__((aligned(64), noinline, noclone))

#ifndef SIX_BYTES_COMPUTATION
#define SIX_BYTES_COMPUTATION 0
#endif
#ifndef WITH_NOP_BEFORE_DECL
#define WITH_NOP_BEFORE_DECL 0
#endif
#ifndef BREAK_DEPENDENCY
#define BREAK_DEPENDENCY 0
#endif
void BENCH_FUNC_ATTR
bench() {
    uint64_t       start, end;
    const uint64_t N        = (1UL << 24);
    const uint64_t WARMUP_N = N << 3;
    uint64_t       v0, v1, dst, loop_cnt;


    asm volatile(
        "xorl %k[v0], %k[v0]\n\t"
        "movl $1, %k[v1]\n\t"
        "movl %[N], %k[loop_cnt]\n\t"
        ".p2align 6\n\t"
        "1:\n\t"
        "xorl %k[loop_cnt], %k[v0]\n\t"
        "xorl %k[loop_cnt], %k[v1]\n\t"
        "movl %k[v0], %k[dst]\n\t"
        ".p2align 4\n\t"
        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        : [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ dst ] "=&r"(dst),
          [ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(WARMUP_N)
        : "cc", "memory");


    asm volatile("lfence\n\t" : : : "memory");
    start = _rdtsc();
    asm volatile(
        "xorl %k[v0], %k[v0]\n\t"
        "movl $1, %k[v1]\n\t"
        "movl %[N], %k[loop_cnt]\n\t"
        "lfence\n\t"
        ".p2align 6\n\t"
        "1:\n\t"
#if SIX_BYTES_COMPUTATION
        "xorl %k[loop_cnt], %k[v0]\n\t"
        "xorl %k[loop_cnt], %k[v1]\n\t"
        "movl %k[v0], %k[dst]\n\t"
#else
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
#endif
        ".p2align 4\n\t"
#if WITH_NOP_BEFORE_DECL
        "nop\n\t"
#endif
#if BREAK_DEPENDENCY
        "xorl %k[v0], %k[v0]\n\t"
        "xorl %k[v1], %k[v1]\n\t"
#endif
        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        "lfence\n\t"
        : [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ dst ] "=&r"(dst),
          [ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(N)
        : "cc", "memory");
    end = _rdtsc();

    double dif = end - start;
    dif /= N;
    printf(
        "SIX_BYTES_COMPUTATION - [%s], WITH_NOP_BEFORE_DECL - [%s], "
        "BREAK_DEPENDENCY - [%s]\n\t",
        SIX_BYTES_COMPUTATION ? "ON" : "OFF",
        WITH_NOP_BEFORE_DECL ? "ON" : "OFF", BREAK_DEPENDENCY ? "ON" : "OFF");

    printf("%.3lf \"Cycles\"\n", dif);
}


int
main(int argc, char ** argv) {
    bench();
}

Run Code Online (Sandbox Code Playgroud)

New Times

SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    0.674 "Cycles"
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    0.799 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    0.747 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    0.650 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    0.727 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
    0.645 "Cycles"

Run Code Online (Sandbox Code Playgroud)

New times have the same trend as before just they are all alot faster.

Edit: Icelake perf numbers

SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    0.681 "Cycles"
    18,522,385,353      lsd.uops                                                    
         1,038,665      idq.dsb_uops                                                
     4,270,402,172      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    0.778 "Cycles"
    20,669,567,680      lsd.uops                                                    
         1,049,193      idq.dsb_uops                                                
     4,807,261,565      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    0.734 "Cycles"
    12,080,048,840      lsd.uops                                                    
         1,035,128      idq.dsb_uops                                                
     4,552,666,461      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
    0.659 "Cycles"
    14,232,154,418      lsd.uops                                                    
         1,150,777      idq.dsb_uops                                                
     4,134,080,501      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
    0.735 "Cycles"
    12,080,166,963      lsd.uops                                                    
           982,311      idq.dsb_uops                                                
     4,553,457,015      cpu-cycles                                                  


SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
    0.644 "Cycles"
    16,374,872,770      lsd.uops                                                    
         1,022,379      idq.dsb_uops                                                
     4,055,306,960      cpu-cycles

Run Code Online (Sandbox Code Playgroud)

Edit: I am certain its not dependency chain or byte related. Its adding one nop (non-backend uop) in certain places really helped performance. Hers a benchmark that demonstrates that really clearly I think.

#include <assert.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <x86intrin.h>
#include <type_traits>


#ifndef UOP
#define UOP 0
#endif
#ifndef BYTE
#define BYTE 0
#endif
#ifndef NOP
#define NOP 0
#endif
#ifndef BREAK_DEP
#define BREAK_DEP 0
#endif
#ifndef COMPUTE_UOP
#define COMPUTE_UOP 0
#endif

#if BREAK_DEP && NOP
#error "Either define NOP or BREAK_DEP"
#endif

#define BENCH_FUNC_ATTR __attribute__((aligned(64), noinline, noclone))
void BENCH_FUNC_ATTR
bench() {
    uint64_t          start, end;
    const uint64_t    N        = (1UL << 31);
    const uint64_t    WARMUP_N = N >> 3;
    register uint64_t v0 asm("rdi");
    register uint64_t v1 asm("rsi");
    register uint64_t v2 asm("rdx");
#if COMPUTE_UOP
    register uint64_t v3 asm("rax");
#endif
    register uint64_t loop_cnt asm("rcx");


    asm volatile(
        "xorl %k[v0], %k[v0]\n\t"
        "xorl %k[v1], %k[v1]\n\t"
        "xorl %k[v2], %k[v2]\n\t"
#if COMPUTE_OUP
        "xorl %k[v3], %k[v3]\n\t"
#endif
        "movl %[N], %k[loop_cnt]\n\t"
        "lfence\n\t"
        ".p2align 6\n\t"
        "1:\n\t"

#if UOP == 1 && BYTE == 1 && NOP == 1
        "nop\n\t"
#elif UOP == 1 && BYTE == 2 && NOP == 1
        "xchg   %%ax,%%ax\n\t"
#elif UOP == 1 && BYTE == 4 && NOP == 1
        "nopl   0x0(%%rax)\n\t"
#elif UOP == 2 && BYTE == 2 && NOP == 1
        "nop\n\t"
        "nop\n\t"
#elif UOP == 2 && BYTE == 4 && NOP == 1
        "xchg   %%ax,%%ax\n\t"
        "xchg   %%ax,%%ax\n\t"
#elif UOP == 4 && BYTE == 4 && NOP == 1
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
#elif UOP == 2 && BYTE == 4 && BREAK_DEP == 1
        "xorl %k[v0], %k[v0]\n\t"
        "xorl %k[v1], %k[v1]\n\t"
#elif UOP == 1 && BYTE == 2 && BREAK_DEP == 1
        "xorl %k[v0], %k[v0]\n\t"
#elif COMPUTE_UOP
        "incl %k[v3]\n\t"
#endif

        "incl %k[v0]\n\t"
        "incl %k[v1]\n\t"
        "incl %k[v2]\n\t"

        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        "lfence\n\t"
        : [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ v2 ] "=&r"(v2),
#if COMPUTE_UOP
          [ v3 ] "=&r"(v3),
#endif
          [ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(WARMUP_N)
        : "cc", "memory");


    start = _rdtsc();
    asm volatile(
        "xorl %k[v0], %k[v0]\n\t"
        "xorl %k[v1], %k[v1]\n\t"
        "xorl %k[v2], %k[v2]\n\t"
#if COMPUTE_OUP
        "xorl %k[v3], %k[v3]\n\t"
#endif
        "movl %[N], %k[loop_cnt]\n\t"
        "lfence\n\t"
        ".p2align 6\n\t"
        "1:\n\t"

#if UOP == 1 && BYTE == 1 && NOP == 1
        "nop\n\t"
#elif UOP == 1 && BYTE == 2 && NOP == 1
        "xchg   %%ax,%%ax\n\t"
#elif UOP == 1 && BYTE == 4 && NOP == 1
        "nopl   0x0(%%rax)\n\t"
#elif UOP == 2 && BYTE == 2 && NOP == 1
        "nop\n\t"
        "nop\n\t"
#elif UOP == 2 && BYTE == 4 && NOP == 1
        "xchg   %%ax,%%ax\n\t"
        "xchg   %%ax,%%ax\n\t"
#elif UOP == 4 && BYTE == 4 && NOP == 1
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
        "nop\n\t"
#elif UOP == 2 && BYTE == 4 && BREAK_DEP == 1
        "xorl %k[v0], %k[v0]\n\t"
        "xorl %k[v1], %k[v1]\n\t"
#elif UOP == 1 && BYTE == 2 && BREAK_DEP == 1
        "xorl %k[v0], %k[v0]\n\t"
#elif COMPUTE_UOP
        "incl %k[v3]\n\t"
#endif

        "incl %k[v0]\n\t"
        "incl %k[v1]\n\t"
        "incl %k[v2]\n\t"

        "decl %k[loop_cnt]\n\t"
        "jnz 1b\n\t"
        "lfence\n\t"
        : [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ v2 ] "=&r"(v2),
#if COMPUTE_UOP
          [ v3 ] "=&r"(v3),
#endif
          [ loop_cnt ] "=&r"(loop_cnt)
        : [ N ] "i"(N)
        : "cc", "memory");
    end = _rdtsc();

    double dif = end - start;
    dif /= N;
    printf("UOP         -> %d\n", UOP);
    printf("BYTE        -> %d\n", BYTE);
    printf("NOP         -> %d\n", NOP);
    printf("BREAK_DEP   -> %d\n", BREAK_DEP);
    printf("COMPUTE_UOP -> %d\n", COMPUTE_UOP);
    printf("%.3lf \"Cycles\"\n", dif);
}


int
main(int argc, char ** argv) {
    bench();
}

Run Code Online (Sandbox Code Playgroud)

Results: You can see basically if its 1 uop that WONT be executed in the backend performance is ~.39 ref-cycles an iteration for a 5-uop loop (ICL front-end width). Otherwise with no NOP or xor-zeroing filler its ~.54 ref-cycles an iteration of a 4-uop loop:

UOP         -> 1
BYTE        -> 1
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.391 "Cycles"
2,420,617,801      idq_uops_not_delivered.cycles_fe_was_ok                                   
    5,840,894      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 1
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.389 "Cycles"
2,419,599,257      idq_uops_not_delivered.cycles_fe_was_ok                                   
    4,791,034      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 1
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.391 "Cycles"
2,420,411,711      idq_uops_not_delivered.cycles_fe_was_ok                                   
    5,915,776      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 2
BYTE        -> 2
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.554 "Cycles"
3,032,038,334      idq_uops_not_delivered.cycles_fe_was_ok                                   
  215,022,743      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 2
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.555 "Cycles"
3,032,032,735      idq_uops_not_delivered.cycles_fe_was_ok                                   
  214,953,593      uops_issued.stall_cycles                                    
--------------------------------------------------------------

UOP         -> 4
BYTE        -> 4
NOP         -> 1
BREAK_DEP   -> 0
COMPUTE_UOP -> 0
0.683 "Cycles"
3,629,685,924      idq_uops_not_delivered.cycles_fe_was_ok

归档时间：	4 年，9 月前
查看次数：	173 次
最近记录：	4 年，9 月前