__builtin_unreachable有哪些优化方便？

Question

__builtin_unreachable有哪些优化方便？

从gcc的文档来看

如果控制流程到达该点__builtin_unreachable,则程序未定义.

我认为__builtin_unreachable可以用各种创造性的方式暗示优化器.所以我做了一个小实验

void stdswap(int& x, int& y)
{
    std::swap(x, y);
}

void brswap(int& x, int& y)
{
    if(&x == &y)
        __builtin_unreachable();
    x ^= y;
    y ^= x;
    x ^= y;
}

void rswap(int& __restrict x, int& __restrict y)
{
    x ^= y;
    y ^= x;
    x ^= y;
}

Run Code Online (Sandbox Code Playgroud)

被编译为(g ++ -O2)

stdswap(int&, int&):
        mov     eax, DWORD PTR [rdi]
        mov     edx, DWORD PTR [rsi]
        mov     DWORD PTR [rdi], edx
        mov     DWORD PTR [rsi], eax
        ret
brswap(int&, int&):
        mov     eax, DWORD PTR [rdi]
        xor     eax, DWORD PTR [rsi]
        mov     DWORD PTR [rdi], eax
        xor     eax, DWORD PTR [rsi]
        mov     DWORD PTR [rsi], eax
        xor     DWORD PTR [rdi], eax
        ret
rswap(int&, int&):
        mov     eax, DWORD PTR [rsi]
        mov     edx, DWORD PTR [rdi]
        mov     DWORD PTR [rdi], eax
        mov     DWORD PTR [rsi], edx
        ret

Run Code Online (Sandbox Code Playgroud)

我认为stdswap并且rswap从优化器的角度来看是最佳的.为什么不brswap编译到同一个东西？我能把它编译成同样的东西__builtin_unreachable吗？

Answer 1

Sta*_*irl 10

目的__builtin_unreachable是帮助编译器删除死代码(程序员知道永远不会执行)并通过让编译器知道路径是"冷"来线性化代码.考虑以下:

void exit_if_true(bool x);

int foo1(bool x)
{
    if (x) {
        exit_if_true(true);
        //__builtin_unreachable(); // we do not enable it here
    } else {
        std::puts("reachable");
    }

    return 0;
}
int foo2(bool x)
{
    if (x) {
        exit_if_true(true);
        __builtin_unreachable();  // now compiler knows exit_if_true
                                  // will not return as we are passing true to it
    } else {
        std::puts("reachable");
    }

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

生成的代码:

foo1(bool):
        sub     rsp, 8
        test    dil, dil
        je      .L2              ; that jump is going to change
        mov     edi, 1
        call    exit_if_true(bool)
        xor     eax, eax         ; that tail is going to be removed
        add     rsp, 8
        ret
.L2:
        mov     edi, OFFSET FLAT:.LC0
        call    puts
        xor     eax, eax
        add     rsp, 8
        ret
foo2(bool):
        sub     rsp, 8
        test    dil, dil
        jne     .L9              ; changed jump
        mov     edi, OFFSET FLAT:.LC0
        call    puts
        xor     eax, eax
        add     rsp, 8
        ret
.L9:
        mov     edi, 1
        call    exit_if_true(bool)

Run Code Online (Sandbox Code Playgroud)

注意差异:

xor eax, eax而ret除去像现在编译器知道这是一个死代码.
编译器交换了分支的顺序:puts现在首先调用分支,以便条件跳转可以更快(即使预测,未采用的分支也更快).

这里的假设是以noreturn函数调用结束的分支,__builtin_unreachable或者只执行一次或导致longjmp调用或异常抛出,这两种情况都很少见,并且在优化期间不需要优先处理.

您正在尝试将其用于不同的目的 - 通过提供有关别名的编译器信息(您可以尝试对齐进行相同操作).不幸的是,GCC不理解这种地址检查.

正如您所注意到的那样,添加__restrict__帮助.所以__restrict__适用于别名,__builtin_unreachable不是.

请看以下使用的示例__builtin_assume_aligned:

void copy1(int *__restrict__ dst, const int *__restrict__ src)
{
    if (reinterpret_cast<uintptr_t>(dst) % 16 == 0) __builtin_unreachable();
    if (reinterpret_cast<uintptr_t>(src) % 16 == 0) __builtin_unreachable();

    dst[0] = src[0];
    dst[1] = src[1];
    dst[2] = src[2];
    dst[3] = src[3];
}

void copy2(int *__restrict__ dst, const int *__restrict__ src)
{
    dst = static_cast<int *>(__builtin_assume_aligned(dst, 16));
    src = static_cast<const int *>(__builtin_assume_aligned(src, 16));

    dst[0] = src[0];
    dst[1] = src[1];
    dst[2] = src[2];
    dst[3] = src[3];
}

Run Code Online (Sandbox Code Playgroud)

生成的代码:

copy1(int*, int const*):
        movdqu  xmm0, XMMWORD PTR [rsi]
        movups  XMMWORD PTR [rdi], xmm0
        ret
copy2(int*, int const*):
        movdqa  xmm0, XMMWORD PTR [rsi]
        movaps  XMMWORD PTR [rdi], xmm0
        ret

Run Code Online (Sandbox Code Playgroud)

你可以假设编译器可以理解这dst % 16 == 0意味着指针是16字节对齐的,但事实并非如此.因此使用未对齐的存储和加载,而第二个版本生成更快的指令,需要对齐地址.

遥不可及，不仅仅是寒冷。“冷”意味着很少执行，尤其是在启动或清理时。您可以在分支条件上使用“__builtin_expect”告诉GCC（[可能/不太可能的宏](/sf/ask/7679731/ kernel-work-and-what-is-their-ben/31133787#31133787)), 或 [`__attribute__((cold))`](https://gcc.gnu.org/onlinedocs/gcc/Common-Function- Attributes.html#Common-Function-Attributes) 的函数或配置文件引导优化 (PGO)。除了那些挑剔的术语之外，很好的答案。 (4认同)

归档时间：	6 年，9 月前
查看次数：	489 次
最近记录：	6 年，3 月前