在具有240个或更多元素的数组上循环时,为什么会对性能产生较大影响?

Guy*_*and 214 arrays performance rust llvm-codegen

当在Rust中的一个数组上运行求和循环时,我发现CAPACITY> = 240 时性能会大幅下降。CAPACITY= 239的速度大约是80倍。

Rust对“短”数组进行了特殊的编译优化吗?

与编译rustc -C opt-level=3

use std::time::Instant;

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

fn main() {
    let mut arr = [0; CAPACITY];
    for i in 0..CAPACITY {
        arr[i] = i;
    }
    let mut sum = 0;
    let now = Instant::now();
    for _ in 0..IN_LOOPS {
        let mut s = 0;
        for i in 0..arr.len() {
            s += arr[i];
        }
        sum += s;
    }
    println!("sum:{} time:{:?}", sum, now.elapsed());
}
Run Code Online (Sandbox Code Playgroud)

Luk*_*odt 339

Summary: below 240, LLVM fully unrolls the inner loop and that lets it notice it can optimize away the repeat loop, breaking your benchmark.



You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization – only performed for length 239 – is responsible for the huge speed difference. But let's start slowly:

(All code in this answer is compiled with -C opt-level=3)

pub fn foo() -> usize {
    let arr = [0; 240];
    let mut s = 0;
    for i in 0..arr.len() {
        s += arr[i];
    }
    s
}
Run Code Online (Sandbox Code Playgroud)

This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change 240 to 239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. Here is a small part of the assembly:

movdqa  xmm1, xmmword ptr [rsp + 32]
movdqa  xmm0, xmmword ptr [rsp + 48]
paddq   xmm1, xmmword ptr [rsp]
paddq   xmm0, xmmword ptr [rsp + 16]
paddq   xmm1, xmmword ptr [rsp + 64]
; more stuff omitted here ...
paddq   xmm0, xmmword ptr [rsp + 1840]
paddq   xmm1, xmmword ptr [rsp + 1856]
paddq   xmm0, xmmword ptr [rsp + 1872]
paddq   xmm0, xmm1
pshufd  xmm1, xmm0, 78
paddq   xmm1, xmm0
Run Code Online (Sandbox Code Playgroud)

This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. incrementing the loop variable, check if the loop has ended and the jump to the start of the loop.

In case you're wondering: the paddq and similar instructions are SIMD instructions which allow summing up multiple values in parallel. Moreover, two 16-byte SIMD registers (xmm0 and xmm1) are used in parallel so that instruction-level parallelism of the CPU can basically execute two of these instructions at the same time. After all, they are independent of one another. In the end, both registers are added together and then horizontally summed down to the scalar result.

Modern mainstream x86 CPUs (not low-power Atom) really can do 2 vector loads per clock when they hit in L1d cache, and paddq throughput is also at least 2 per clock, with 1 cycle latency on most CPUs. See https://agner.org/optimize/ and also this Q&A about multiple accumulators to hide latency (of FP FMA for a dot product) and bottleneck on throughput instead.

LLVM does unroll small loops some when it's not fully unrolling, and still uses multiple accumulators. So usually, front-end bandwidth and back-end latency bottlenecks aren't a huge problem for LLVM-generated loops even without full unrolling.


But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:

const CAPACITY: usize = 239;
const IN_LOOPS: usize = 500000;

pub fn foo() -> usize {
    let mut arr = [0; CAPACITY];
    for i in 0..CAPACITY {
        arr[i] = i;
    }

    let mut sum = 0;
    for _ in 0..IN_LOOPS {
        let mut s = 0;
        for i in 0..arr.len() {
            s += arr[i];
        }
        sum += s;
    }

    sum
}
Run Code Online (Sandbox Code Playgroud)

(On Godbolt Compiler Explorer)

The assembly for CAPACITY = 240 looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.

The important difference is that for 239, LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up sum a bunch of times!

First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly; the comments with * are especially important):

        ; at the start of the function, `rbx` was set to 0

        movq    rax, xmm1     ; result of SIMD summing up stored in `rax`
        add     rax, 711      ; add up missing terms from loop unrolling
        mov     ecx, 500000   ; * init loop variable outer loop
.LBB0_1:
        add     rbx, rax      ; * rbx += rax
        add     rcx, -1       ; * decrement loop variable
        jne     .LBB0_1       ; * if loop variable != 0 jump to LBB0_1
        mov     rax, rbx      ; move rbx (the sum) back to rax
        ; two unimportant instructions omitted
        ret                   ; the return value is stored in `rax`
Run Code Online (Sandbox Code Playgroud)

As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.

This means the runtime changes from CAPACITY * IN_LOOPS to CAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.


An additional note: can you do anything about this? Not really. LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.

As a last note about idiomatic Rust code: arr.iter().sum() is a better way to sum up all elements of an array. And changing this in the second example does not lead to any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.

  • 为什么编译器或LLVM没有意识到可以在编译时进行整个计算?我本来希望对循环结果进行硬编码。还是使用“即时”可以防止这种情况? (7认同)
  • @LukasKalbertodt [LLVM中发生了其他事情](https://godbolt.org/z/fIbhYX)打开AVX2不会造成太大的改变。[也被复制为锈](https://rust.godbolt.org/z/QwZnUQ) (4认同)
  • @Mgetz有趣!但是,让该阈值取决于可用的SIMD指令对我来说听起来并不疯狂,因为这最终决定了完全展开循环中的指令数量。但不幸的是,我不能肯定地说。有一个LLVM开发人员回答这个问题会很甜蜜。 (4认同)
  • @JosephGarvin:我认为这是因为完全展开会允许稍后的优化传递看到这一点。请记住,优化的编译器仍然关心快速编译以及提高效率的asm,因此,他们必须限制所进行的任何分析的最坏情况下的复杂度,因此,无需花费数小时/天就可以编译出带有复杂循环的讨厌的源代码。但是,是的,这显然是错过了对大小> = 240的优化。我想知道是否不是为了避免破坏简单的基准测试而优化循环内部的循环?可能不是,但是也许。 (4认同)
  • @ lukas-kalbertodt感谢您的出色回答!现在我也明白了为什么直接在非本地`s`上更新`sum`的原始代码运行得慢得多。```对于0中的i..arr.len(){sum + = arr [i]; }``` (2认同)

mja*_*mja 28

除了Lukas的答案,如果您想使用迭代器,请尝试以下操作:

const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;

pub fn bar() -> usize {
    (0..CAPACITY).sum::<usize>() * IN_LOOPS
}
Run Code Online (Sandbox Code Playgroud)

感谢@Chris Morgan提供有关范围模式的建议。

优化的装配是相当不错的:

example::bar:
        movabs  rax, 14340000000
        ret
Run Code Online (Sandbox Code Playgroud)

  • 我实际上会解释说,程序集实际上并没有进行计算,但是LLVM在这种情况下已经预先计算了答案。 (9认同)
  • 或者更好的是,((0..CAPACITY).sum :: &lt;usize&gt;()* IN_LOOPS`,产生相同的结果。 (3认同)