为什么 clang 的尾声使用 `add $N, %rsp` 而不是 `mov %rbp, %rsp` 来恢复 `%rsp`？

Question

为什么 clang 的尾声使用 `add $N, %rsp` 而不是 `mov %rbp, %rsp` 来恢复 `%rsp`？

Amm*_*izi 3 assembly x86-64 clang micro-optimization

考虑以下：

ammarfaizi2@integral:/tmp$ vi test.c
ammarfaizi2@integral:/tmp$ cat test.c

extern void use_buffer(void *buf);

void a_func(void)
{
    char buffer[4096];
    use_buffer(buffer);
}

__asm__("emit_mov_rbp_to_rsp:\n\tmovq %rbp, %rsp");

ammarfaizi2@integral:/tmp$ clang -Wall -Wextra -c -O3 -fno-omit-frame-pointer test.c -o test.o
ammarfaizi2@integral:/tmp$ objdump -d test.o

test.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <emit_mov_rbp_to_rsp>:
   0: 48 89 ec              mov    %rbp,%rsp
   3: 66 2e 0f 1f 84 00 00  cs nopw 0x0(%rax,%rax,1)
   a: 00 00 00 
   d: 0f 1f 00              nopl   (%rax)

0000000000000010 <a_func>:
  10: 55                    push   %rbp
  11: 48 89 e5              mov    %rsp,%rbp
  14: 48 81 ec 00 10 00 00  sub    $0x1000,%rsp
  1b: 48 8d bd 00 f0 ff ff  lea    -0x1000(%rbp),%rdi
  22: e8 00 00 00 00        call   27 <a_func+0x17>
  27: 48 81 c4 00 10 00 00  add    $0x1000,%rsp
  2e: 5d                    pop    %rbp
  2f: c3                    ret    
ammarfaizi2@integral:/tmp$

Run Code Online (Sandbox Code Playgroud)

在结束时a_func()，在 return 之前，是函数尾声恢复%rsp。它使用add $0x1000, %rspwhich 产生48 81 c4 00 10 00 00.

不能只使用mov %rbp, %rsp仅产生 3 个字节的内容吗48 89 ec？

为什么 clang 不使用更短的方式（mov %rbp, %rsp）？

考虑到代码大小的权衡，使用add $0x1000, %rsp代替的优点是什么mov %rbp, %rsp？

更新（额外）

即使使用-Os，它仍然会产生相同的代码。所以我认为必须有一个合理的理由来避免mov %rbp, %rsp。

ammarfaizi2@integral:/tmp$ clang -Wall -Wextra -c -Os -fno-omit-frame-pointer test.c -o test.o
ammarfaizi2@integral:/tmp$ objdump -d test.o

test.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <emit_mov_rbp_to_rsp>:
   0:   48 89 ec                mov    %rbp,%rsp

0000000000000003 <a_func>:
   3:   55                      push   %rbp
   4:   48 89 e5                mov    %rsp,%rbp
   7:   48 81 ec 00 10 00 00    sub    $0x1000,%rsp
   e:   48 8d bd 00 f0 ff ff    lea    -0x1000(%rbp),%rdi
  15:   e8 00 00 00 00          call   1a <a_func+0x17>
  1a:   48 81 c4 00 10 00 00    add    $0x1000,%rsp
  21:   5d                      pop    %rbp
  22:   c3                      ret    
ammarfaizi2@integral:/tmp$

Run Code Online (Sandbox Code Playgroud)

Answer 1

Pet*_*des 6

如果它完全使用 RBP 作为帧指针，是的，mov %rbp, %rsp会更紧凑，并且至少在所有 x86 微架构上速度一样快。（mov-elimination 甚至可能适用于它）。更重要的是，当 add 常量不适合 imm8 时。

这可能是一个错过的优化，与https://bugs.llvm.org/show_bug.cgi?id=10319非常相似（建议使用leavemov/pop 代替，这会在 Intel 上花费 1 个额外的 uop，但会节省另外 3 个字节）。它指出，在正常情况下，总体静态代码大小节省相当小，但没有考虑效率优势。在正常构建（-O2没有-fno-omit-frame-pointer）中，只有少数函数会使用帧指针（仅当使用 VLA / alloca 或过度对齐堆栈时），因此可能的好处甚至更小。

从这个 bug 看来，这只是 LLVM 懒得去寻找的一个窥视孔，因为许多函数还需要恢复其他寄存器，所以您实际上需要add一些其他值来将 RSP 指向其他推送下方。

（GCC 有时会使用mov恢复调用保留的寄存器，以便它可以使用leave。使用帧指针，这使得寻址模式编码起来相当紧凑，尽管 4 字节 qword 当然mov -8(%rbp), %r12仍然不如 2 字节 pop 小。并且如果我们没有帧指针（例如在-O2代码中），mov %rbp, %rsp则永远不是一个选项。）

在考虑“不值得寻找”的理由之前，我想到了另一个小好处：

调用保存/恢复RBP的函数后，RBP是加载结果。因此mov %rbp, %rsp，在之后，将来使用 RSP 需要等待该负载。可能某些极端情况最终会在存储转发延迟方面遇到瓶颈，而寄存器修改仅为 1 个周期。

但总的来说，这似乎不太值得额外的代码大小；我预计这种极端情况很少见。尽管 a 需要新的 RSP 值pop %rbp，但调用者恢复的 RBP 值是我们返回后两次加载链的结果。（幸运的是ret，有分支预测来隐藏延迟。）

因此，在某些基准测试中两种方法都值得尝试；例如，在一些标准基准（如 SPECint）上将其与 LLVM 的调整版本进行比较。

谢谢，看来我们有这个 https://bugs.llvm.org/show_bug.cgi?id=10319 的副本 (2认同)

归档时间：	4 年，3 月前
查看次数：	288 次
最近记录：	4 年，3 月前