为什么在展开的ADD循环中重新初始化寄存器，即使在循环中包含更多指令的情况下，其运行速度也更快？

Question

为什么在展开的ADD循环中重新初始化寄存器，即使在循环中包含更多指令的情况下，其运行速度也更快？

ean*_*mos 2 performance x86 assembly cpu-architecture

我有以下代码：

#include <iostream>
#include <chrono>

#define ITERATIONS "10000"

int main()
{
    /*
    ======================================
    The first case: the MOV is outside the loop.
    ======================================
    */

    auto t1 = std::chrono::high_resolution_clock::now();

    asm("mov $100, %eax\n"
        "mov $200, %ebx\n"
        "mov $" ITERATIONS ", %ecx\n"
        "lp_test_time1:\n"
        "   add %eax, %ebx\n" // 1
        "   add %eax, %ebx\n" // 2
        "   add %eax, %ebx\n" // 3
        "   add %eax, %ebx\n" // 4
        "   add %eax, %ebx\n" // 5
        "loop lp_test_time1\n");

    auto t2 = std::chrono::high_resolution_clock::now();
    auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();

    std::cout << time;

    /*
    ======================================
    The second case: the MOV is inside the loop (faster).
    ======================================
    */

    t1 = std::chrono::high_resolution_clock::now();

    asm("mov $100, %eax\n"
        "mov $" ITERATIONS ", %ecx\n"
        "lp_test_time2:\n"
        "   mov $200, %ebx\n"
        "   add %eax, %ebx\n" // 1
        "   add %eax, %ebx\n" // 2
        "   add %eax, %ebx\n" // 3
        "   add %eax, %ebx\n" // 4
        "   add %eax, %ebx\n" // 5
        "loop lp_test_time2\n");

    t2 = std::chrono::high_resolution_clock::now();
    time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
    std::cout << '\n' << time << '\n';
}

Run Code Online (Sandbox Code Playgroud)

第一种情况

我用

gcc version 9.2.0 (GCC)
Target: x86_64-pc-linux-gnu

gcc -Wall -Wextra -pedantic -O0 -o proc proc.cpp

Run Code Online (Sandbox Code Playgroud)

它的输出是

14474
5837

Run Code Online (Sandbox Code Playgroud)

我也用Clang编译了同样的结果。

那么，为什么第二种情况更快（几乎是3倍的加速）？它实际上与某些微建筑细节有关吗？如果有问题，我可以使用AMD的CPU：“ AMD A9-9410 RADEON R5，5个计算核心2C + 3G”。

Answer 1

Pet*_*des 5

mov $200, %ebx循环内部通过打破了循环承载的依赖关系链ebx，从而允许乱序执行add在多次迭代中重叠5 条指令链。

Without it, the chain of add instructions bottlenecks the loop on the latency of the add (1 cycle) critical path, instead of the throughput (4/cycle on Excavator, improved from 2/cycle on Steamroller). Your CPU is an Excavator core.

AMD since Bulldozer has an efficient loop instruction (only 1 uop), unlike Intel CPUs where loop would bottleneck either loop at 1 iteration per 7 cycles. (https://agner.org/optimize/ for instruction tables, microarch guide, and more details on everything in this answer.)

With loop and mov taking slots in the front-end (and back-end execution units) away from add, a 3x instead of 4x speedup looks about right.

See this answer for an intro to how CPUs find and exploit Instruction Level Parallelism (ILP).

See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for some in-depth details about overlapping independent dep chains.

BTW, 10k iterations is not many. Your CPU might not even ramp up out of idle speed in that time. Or might jump to max speed for most of the 2nd loop but none of the first. So be careful with microbenchmarks like this.

Also, your inline asm is unsafe because you forgot to declare clobbers on EAX, EBX, and ECX. You step on the compiler's registers without telling it. Normally you should always compile with optimization enabled, but your code would probably break if you did that.

归档时间：	6 年，3 月前
查看次数：	64 次
最近记录：	6 年，3 月前