Bee*_*ope 21 c++ optimization x86 intel micro-optimization
Consider this simple C++ function to calculate the prefix sum of an array:
void prefix_sum(const uint32_t* input, uint32_t* output, size_t size) {
uint32_t total = 0;
for (size_t i = 0; i < size; i++) {
total += input[i];
output[i] = total;
}
}
Run Code Online (Sandbox Code Playgroud)
The loop compiles to the following assembly on gcc 5.5:
.L5:
add ecx, DWORD PTR [rdi+rax*4]
mov DWORD PTR [rsi+rax*4], ecx
add rax, 1
cmp rdx, rax
jne .L5
Run Code Online (Sandbox Code Playgroud)
I don't see anything that would prevent this from running at 1 cycle per iteration, yet I consistently measure it at 1.32 (+/- 0.01) cycles/iteration on my Skylake i7-6700HQ, when running it against 8 KiB input/output arrays.
The loop is served out of the uop cache and doesn't cross any uop cache boundary and performance counters don't indicate any front-end bottleneck.
It's 4 fused uops1, and this CPU can sustain 4 fused ops/cycle.
There are carried dependency chains through ecx and rax, each of 1 cycle, but these add uops can go to any of the 4 ALU ports, so seem unlikely to conflict. The fused cmp needs to go to p6 which is more of a concern, but I measure only 1.1 uops/iteration to p6. That would explain 1.1 cycles per iteration, but not 1.4. If I unroll the loop by 2x port pressure is much lower: less than 0.7 uops to all of p0156, yet performance is still unexpectedly slow at 1.3 cycles per iteration.
There is one store per iteration, but we can do one store per cycle.
There is one load per iteration, but we can do two of those per cycle.
There are two complex AGUs per cycle, but we can do two of those per cycle.
What's the bottleneck here?
Interestingly I tried the Ithermal performance predictor and it gets it almost exactly right: estimating 1.314 cycles versus my measurement of 1.32.
1 I confirmed macro and micro-fusion fusion via the uops_issued.any counter which counts in the fused domain and reads 4.0 fused uops per iteration for this loop.
我只是按照 I Thermal Performance Predictor 上的说明进行操作,我可能已经发现了问题。尝试
add ecx, DWORD PTR [rdi]
mov DWORD PTR [rsi], ecx
add rax, 1
cmp rdx, rax
Run Code Online (Sandbox Code Playgroud)
每次迭代给出惊人的 1.131 个周期。在每次迭代中添加 0 进行交叉检查(再次给出 1.3 个周期)消除了存储/加载瓶颈的可能性。这最终表明寻址模式存在问题。
(编者注:这是有趣的实验数据,与我在 Agner Fog 博客上的帖子中发布的内容相匹配,下面的猜测会误解。更简单的寻址模式可以加快速度,即使没有分层。)
(编者注:这部分是错误的:我们从问题中知道没有未层压,因为uops_issued.any每次迭代= 4。)
我认为你的CPU在索引寻址的情况下会取消层压你的add/mov。这种行为在多种架构(SnB、SKL、HWL)中都有详细记录,并且有人在 stackoverflow 上做了很好的工作,描述了整个事情:/sf/answers/2171938681/
简而言之:如果寄存器和标志太多涉及时,融合运算(DSB)将被取消层压(IDQ),从而有效地再次取消融合。
其他资源:
广告融合限制:https://www.agner.org/optimize/blog/read.php ?i=415#852
取消层压:https://easyperf.net/blog/2018/02/15/MicroFusion-in-Intel-CPUs#unlamination-example-1