Noa*_*oah 7 assembly x86-64 cpu-architecture micro-optimization
All benchmarks are done on: Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (ark)
Edit: I was not able to reproduce this on broadwell and @PeterCordes was unable to reproduce it on skylake
I was trying to benchmark different methods of doing integer min(a, b) but ran into some unexplained behavior that I've boiled down to the following benchmark:
#define BENCH_FUNC_ATTR __attribute__((aligned(64), noinline, noclone))
#define SIX_BYTES_COMPUTATION 1
#define WITH_NOP_BEFORE_DECL 0
#define BREAK_DEPENDENCY 0
void BENCH_FUNC_ATTR
bench() {
uint64_t start, end;
const uint64_t N = 1000000;
start = _rdtsc();
uint64_t v0, v1, dst, loop_cnt;
asm volatile(
"xorl %k[v0], %k[v0]\n\t"
"movl $1, %k[v1]\n\t"
"movl %[N], %k[loop_cnt]\n\t"
".p2align 6\n\t"
"1:\n\t"
#if SIX_BYTES_COMPUTATION
"xorl %k[loop_cnt], %k[v0]\n\t"
"xorl %k[loop_cnt], %k[v1]\n\t"
"movl %k[v0], %k[dst]\n\t"
#else
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
#endif
".p2align 4\n\t"
#if WITH_NOP_BEFORE_DECL
"nop\n\t"
#endif
#if BREAK_DEPENDENCY
"xorl %k[v0], %k[v0]\n\t"
"xorl %k[v1], %k[v1]\n\t"
#endif
// macro-fusion is NOT broken
"decl %k[loop_cnt]\n\t"
"jnz 1b\n\t"
: [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ dst ] "=&r"(dst),
[ loop_cnt ] "=&r"(loop_cnt)
: [ N ] "i"(N)
: "cc", "memory");
end = _rdtsc();
double dif = end - start;
dif /= N;
printf(
"SIX_BYTES_COMPUTATION - [%s], WITH_NOP_BEFORE_DECL - [%s], "
"BREAK_DEPENDENCY - [%s]\n\t",
SIX_BYTES_COMPUTATION ? "ON" : "OFF",
WITH_NOP_BEFORE_DECL ? "ON" : "OFF", BREAK_DEPENDENCY ? "ON" : "OFF");
printf("%.3lf \"Cycles\"\n", dif);
}
Run Code Online (Sandbox Code Playgroud)
Turning on WITH_NOP_BEFORE_DECL so that there is a nop before the decl + jnz causes a measureable performance improvement when SIX_BYTES_COMPUTATION is turned on though causes measurable performance degradation when SIX_BYTES_COMPUTATION is turned off.
Here are the numbers:
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
2.080 "Cycles" <--- Just 6 nops
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
2.363 "Cycles" <--- Performance degradation from previous
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
2.185 "Cycles" <--- Computation then decl + jnz
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
1.945 "Cycles" <--- Performance improvement from previous
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
1.919 "Cycles" <--- Breaking dependencies has best performance
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [ON]
2.046 "Cycles" <--- nop hurts performance when breaking dependencies
Run Code Online (Sandbox Code Playgroud)
It might have to do with the register file filling up? I was able to find one potentially interesting metric uops_issued.stall_cycles [Cycles when RAT does not issue Uops to RS for the thread] which has the following outputs:
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
473,647 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
495,380 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
1,406,244 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
875,364 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
647,297 uops_issued.stall_cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [ON]
501,015 uops_issued.stall_cycles
Run Code Online (Sandbox Code Playgroud)
It does seem to correspond with SIX_BYTES_COMPUTATION on and WITH_NOP_BEFORE_DECL on or off but I'm not sure 1) why a nop would save room in the register files.
I'm pretty sure its not an alignment issue because with the .p2align 4 between the first 6 bytes of the loop body and the decl + jnz the decl + jnz will be on a different 16 byte aligned region either way and it the performance difference is reversed depending on what is in the loop body (so if it where an alignment thing it shouldn't matter if loop body is nops or computation).
I was thinking it might to do with some dependency chain issue but because if I break the dependency on v0 and v1 at the end of the loop then WITH_NOP_BEFORE_DECL turned on causes a performance degradation. I am very likely wrong about that though because I have no idea why a nop before the end of the loop would affect any dependency issues.
It almost definitely does not have to do with port scheduling. I was thinking maybe something weird was going where the nop by chance lead to better scheduling but didn't get any different in uops on ports 1,2,5,6 with out or WITH_NOP_BEFORE_DECL on:
Instruction per port with SIX_BYTES_COMPUTATION on and WITH_NOP_BEFORE_DECL Off:
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF]
1,147,196 uops_dispatched.port_0
1,114,665 uops_dispatched.port_1
1,138,238 uops_dispatched.port_5
1,266,212 uops_dispatched.port_6
Run Code Online (Sandbox Code Playgroud)
Instruction per port with SIX_BYTES_COMPUTATION on and WITH_NOP_BEFORE_DECL On:
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON]
1,177,092 uops_dispatched.port_0
1,081,734 uops_dispatched.port_1
1,103,314 uops_dispatched.port_5
1,296,546 uops_dispatched.port_6
Run Code Online (Sandbox Code Playgroud)
My leading theory is that there is some inefficiency in the register renaming process that is the performance bound without the nop and by luck the nop hides that issue but I am not at all confident in that.
Can anyone help me understand this behavior.
Edit: Full cpp code and new times w/ warmup and lfence before rdtsc.
New Code
#include <assert.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <x86intrin.h>
#include <type_traits>
#define BENCH_FUNC_ATTR __attribute__((aligned(64), noinline, noclone))
#ifndef SIX_BYTES_COMPUTATION
#define SIX_BYTES_COMPUTATION 0
#endif
#ifndef WITH_NOP_BEFORE_DECL
#define WITH_NOP_BEFORE_DECL 0
#endif
#ifndef BREAK_DEPENDENCY
#define BREAK_DEPENDENCY 0
#endif
void BENCH_FUNC_ATTR
bench() {
uint64_t start, end;
const uint64_t N = (1UL << 24);
const uint64_t WARMUP_N = N << 3;
uint64_t v0, v1, dst, loop_cnt;
asm volatile(
"xorl %k[v0], %k[v0]\n\t"
"movl $1, %k[v1]\n\t"
"movl %[N], %k[loop_cnt]\n\t"
".p2align 6\n\t"
"1:\n\t"
"xorl %k[loop_cnt], %k[v0]\n\t"
"xorl %k[loop_cnt], %k[v1]\n\t"
"movl %k[v0], %k[dst]\n\t"
".p2align 4\n\t"
"decl %k[loop_cnt]\n\t"
"jnz 1b\n\t"
: [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ dst ] "=&r"(dst),
[ loop_cnt ] "=&r"(loop_cnt)
: [ N ] "i"(WARMUP_N)
: "cc", "memory");
asm volatile("lfence\n\t" : : : "memory");
start = _rdtsc();
asm volatile(
"xorl %k[v0], %k[v0]\n\t"
"movl $1, %k[v1]\n\t"
"movl %[N], %k[loop_cnt]\n\t"
"lfence\n\t"
".p2align 6\n\t"
"1:\n\t"
#if SIX_BYTES_COMPUTATION
"xorl %k[loop_cnt], %k[v0]\n\t"
"xorl %k[loop_cnt], %k[v1]\n\t"
"movl %k[v0], %k[dst]\n\t"
#else
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
#endif
".p2align 4\n\t"
#if WITH_NOP_BEFORE_DECL
"nop\n\t"
#endif
#if BREAK_DEPENDENCY
"xorl %k[v0], %k[v0]\n\t"
"xorl %k[v1], %k[v1]\n\t"
#endif
"decl %k[loop_cnt]\n\t"
"jnz 1b\n\t"
"lfence\n\t"
: [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ dst ] "=&r"(dst),
[ loop_cnt ] "=&r"(loop_cnt)
: [ N ] "i"(N)
: "cc", "memory");
end = _rdtsc();
double dif = end - start;
dif /= N;
printf(
"SIX_BYTES_COMPUTATION - [%s], WITH_NOP_BEFORE_DECL - [%s], "
"BREAK_DEPENDENCY - [%s]\n\t",
SIX_BYTES_COMPUTATION ? "ON" : "OFF",
WITH_NOP_BEFORE_DECL ? "ON" : "OFF", BREAK_DEPENDENCY ? "ON" : "OFF");
printf("%.3lf \"Cycles\"\n", dif);
}
int
main(int argc, char ** argv) {
bench();
}
Run Code Online (Sandbox Code Playgroud)
New Times
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
0.674 "Cycles"
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
0.799 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
0.747 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
0.650 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
0.727 "Cycles"
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
0.645 "Cycles"
Run Code Online (Sandbox Code Playgroud)
New times have the same trend as before just they are all alot faster.
Edit: Icelake perf numbers
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
0.681 "Cycles"
18,522,385,353 lsd.uops
1,038,665 idq.dsb_uops
4,270,402,172 cpu-cycles
SIX_BYTES_COMPUTATION - [OFF], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
0.778 "Cycles"
20,669,567,680 lsd.uops
1,049,193 idq.dsb_uops
4,807,261,565 cpu-cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
0.734 "Cycles"
12,080,048,840 lsd.uops
1,035,128 idq.dsb_uops
4,552,666,461 cpu-cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [ON], BREAK_DEPENDENCY - [OFF]
0.659 "Cycles"
14,232,154,418 lsd.uops
1,150,777 idq.dsb_uops
4,134,080,501 cpu-cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [OFF]
0.735 "Cycles"
12,080,166,963 lsd.uops
982,311 idq.dsb_uops
4,553,457,015 cpu-cycles
SIX_BYTES_COMPUTATION - [ON], WITH_NOP_BEFORE_DECL - [OFF], BREAK_DEPENDENCY - [ON]
0.644 "Cycles"
16,374,872,770 lsd.uops
1,022,379 idq.dsb_uops
4,055,306,960 cpu-cycles
Run Code Online (Sandbox Code Playgroud)
Edit: I am certain its not dependency chain or byte related. Its adding one nop (non-backend uop) in certain places really helped performance. Hers a benchmark that demonstrates that really clearly I think.
#include <assert.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <x86intrin.h>
#include <type_traits>
#ifndef UOP
#define UOP 0
#endif
#ifndef BYTE
#define BYTE 0
#endif
#ifndef NOP
#define NOP 0
#endif
#ifndef BREAK_DEP
#define BREAK_DEP 0
#endif
#ifndef COMPUTE_UOP
#define COMPUTE_UOP 0
#endif
#if BREAK_DEP && NOP
#error "Either define NOP or BREAK_DEP"
#endif
#define BENCH_FUNC_ATTR __attribute__((aligned(64), noinline, noclone))
void BENCH_FUNC_ATTR
bench() {
uint64_t start, end;
const uint64_t N = (1UL << 31);
const uint64_t WARMUP_N = N >> 3;
register uint64_t v0 asm("rdi");
register uint64_t v1 asm("rsi");
register uint64_t v2 asm("rdx");
#if COMPUTE_UOP
register uint64_t v3 asm("rax");
#endif
register uint64_t loop_cnt asm("rcx");
asm volatile(
"xorl %k[v0], %k[v0]\n\t"
"xorl %k[v1], %k[v1]\n\t"
"xorl %k[v2], %k[v2]\n\t"
#if COMPUTE_OUP
"xorl %k[v3], %k[v3]\n\t"
#endif
"movl %[N], %k[loop_cnt]\n\t"
"lfence\n\t"
".p2align 6\n\t"
"1:\n\t"
#if UOP == 1 && BYTE == 1 && NOP == 1
"nop\n\t"
#elif UOP == 1 && BYTE == 2 && NOP == 1
"xchg %%ax,%%ax\n\t"
#elif UOP == 1 && BYTE == 4 && NOP == 1
"nopl 0x0(%%rax)\n\t"
#elif UOP == 2 && BYTE == 2 && NOP == 1
"nop\n\t"
"nop\n\t"
#elif UOP == 2 && BYTE == 4 && NOP == 1
"xchg %%ax,%%ax\n\t"
"xchg %%ax,%%ax\n\t"
#elif UOP == 4 && BYTE == 4 && NOP == 1
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
#elif UOP == 2 && BYTE == 4 && BREAK_DEP == 1
"xorl %k[v0], %k[v0]\n\t"
"xorl %k[v1], %k[v1]\n\t"
#elif UOP == 1 && BYTE == 2 && BREAK_DEP == 1
"xorl %k[v0], %k[v0]\n\t"
#elif COMPUTE_UOP
"incl %k[v3]\n\t"
#endif
"incl %k[v0]\n\t"
"incl %k[v1]\n\t"
"incl %k[v2]\n\t"
"decl %k[loop_cnt]\n\t"
"jnz 1b\n\t"
"lfence\n\t"
: [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ v2 ] "=&r"(v2),
#if COMPUTE_UOP
[ v3 ] "=&r"(v3),
#endif
[ loop_cnt ] "=&r"(loop_cnt)
: [ N ] "i"(WARMUP_N)
: "cc", "memory");
start = _rdtsc();
asm volatile(
"xorl %k[v0], %k[v0]\n\t"
"xorl %k[v1], %k[v1]\n\t"
"xorl %k[v2], %k[v2]\n\t"
#if COMPUTE_OUP
"xorl %k[v3], %k[v3]\n\t"
#endif
"movl %[N], %k[loop_cnt]\n\t"
"lfence\n\t"
".p2align 6\n\t"
"1:\n\t"
#if UOP == 1 && BYTE == 1 && NOP == 1
"nop\n\t"
#elif UOP == 1 && BYTE == 2 && NOP == 1
"xchg %%ax,%%ax\n\t"
#elif UOP == 1 && BYTE == 4 && NOP == 1
"nopl 0x0(%%rax)\n\t"
#elif UOP == 2 && BYTE == 2 && NOP == 1
"nop\n\t"
"nop\n\t"
#elif UOP == 2 && BYTE == 4 && NOP == 1
"xchg %%ax,%%ax\n\t"
"xchg %%ax,%%ax\n\t"
#elif UOP == 4 && BYTE == 4 && NOP == 1
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
#elif UOP == 2 && BYTE == 4 && BREAK_DEP == 1
"xorl %k[v0], %k[v0]\n\t"
"xorl %k[v1], %k[v1]\n\t"
#elif UOP == 1 && BYTE == 2 && BREAK_DEP == 1
"xorl %k[v0], %k[v0]\n\t"
#elif COMPUTE_UOP
"incl %k[v3]\n\t"
#endif
"incl %k[v0]\n\t"
"incl %k[v1]\n\t"
"incl %k[v2]\n\t"
"decl %k[loop_cnt]\n\t"
"jnz 1b\n\t"
"lfence\n\t"
: [ v0 ] "=&r"(v0), [ v1 ] "=&r"(v1), [ v2 ] "=&r"(v2),
#if COMPUTE_UOP
[ v3 ] "=&r"(v3),
#endif
[ loop_cnt ] "=&r"(loop_cnt)
: [ N ] "i"(N)
: "cc", "memory");
end = _rdtsc();
double dif = end - start;
dif /= N;
printf("UOP -> %d\n", UOP);
printf("BYTE -> %d\n", BYTE);
printf("NOP -> %d\n", NOP);
printf("BREAK_DEP -> %d\n", BREAK_DEP);
printf("COMPUTE_UOP -> %d\n", COMPUTE_UOP);
printf("%.3lf \"Cycles\"\n", dif);
}
int
main(int argc, char ** argv) {
bench();
}
Run Code Online (Sandbox Code Playgroud)
Results: You can see basically if its 1 uop that WONT be executed in the backend performance is ~.39 ref-cycles an iteration for a 5-uop loop (ICL front-end width). Otherwise with no NOP or xor-zeroing filler its ~.54 ref-cycles an iteration of a 4-uop loop:
UOP -> 1
BYTE -> 1
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.391 "Cycles"
2,420,617,801 idq_uops_not_delivered.cycles_fe_was_ok
5,840,894 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 1
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.389 "Cycles"
2,419,599,257 idq_uops_not_delivered.cycles_fe_was_ok
4,791,034 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 1
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.391 "Cycles"
2,420,411,711 idq_uops_not_delivered.cycles_fe_was_ok
5,915,776 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 2
BYTE -> 2
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.554 "Cycles"
3,032,038,334 idq_uops_not_delivered.cycles_fe_was_ok
215,022,743 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 2
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.555 "Cycles"
3,032,032,735 idq_uops_not_delivered.cycles_fe_was_ok
214,953,593 uops_issued.stall_cycles
--------------------------------------------------------------
UOP -> 4
BYTE -> 4
NOP -> 1
BREAK_DEP -> 0
COMPUTE_UOP -> 0
0.683 "Cycles"
3,629,685,924 idq_uops_not_delivered.cycles_fe_was_ok