当被除数为64位且商为32位时，如何使gcc或clang使用64位/32位除法而不是128位/64位除法？

Question

当被除数为64位且商为32位时，如何使gcc或clang使用64位/32位除法而不是128位/64位除法？

xiv*_*r77 5 c x86 gcc clang compiler-optimization

Currently, from research and various attempts, I'm pretty sure that the only solution to this problem is to use assembly. I'm posting this question to show an existing problem, and maybe get attention from compiler developers, or get some hits from searches about similar problems.

If anything changes in the future, I will accept it as an answer.

This is a very related question for MSVC.

In x86_64 machines, it is faster to use div/idiv with a 32-bit operand than a 64-bit operand. When the dividend is 64-bit and the divisor is 32-bit, and when you know that the quotient will fit in 32 bits, you don't have to use the 64-bit div/idiv. You can split the 64-bit dividend into two 32-bit registers, and even with this overhead, performing a 32-bit div on two 32-bit registers will be faster than doing a 64-bit div with a full 64-bit register.

The compiler will produce a 64-bit div with this function, and that is correct because for a 32-bit div, if the quotient of the division does not fit in 32 bits, an hardware exception occurs.

uint32_t div_c(uint64_t a, uint32_t b) {
    return a / b;
}

Run Code Online (Sandbox Code Playgroud)

However, if the quotient is known to be fit in 32 bits, doing a full 64-bit division is unnecessary. I used __builtin_unreachable to tell the compiler about this information, but it doesn't make a difference.

uint32_t div_c_ur(uint64_t a, uint32_t b) {
    uint64_t q = a / b;
    if (q >= 1ull << 32) __builtin_unreachable();
    return q;
}

Run Code Online (Sandbox Code Playgroud)

For both div_c and div_c_ur, the output from gcc is,

mov     rax, rdi
mov     esi, esi
xor     edx, edx
div     rsi
ret

Run Code Online (Sandbox Code Playgroud)

clang does an interesting optimization of checking the dividend size, but it still uses a 64-bit div when the dividend is 64-bit.

        mov     rax, rdi
        mov     ecx, esi
        mov     rdx, rdi
        shr     rdx, 32
        je      .LBB0_1
        xor     edx, edx
        div     rcx
        ret
.LBB0_1:
        xor     edx, edx
        div     ecx
        ret

Run Code Online (Sandbox Code Playgroud)

I had to write straight in assembly to achieve what I want. I couldn't find any other way to do this.

__attribute__((naked, sysv_abi))
uint32_t div_asm(uint64_t, uint32_t) {__asm__(
    "mov eax, edi\n\t"
    "mov rdx, rdi\n\t"
    "shr rdx, 32\n\t"
    "div esi\n\t"
    "ret\n\t"
);}

Run Code Online (Sandbox Code Playgroud)

Was it worth it? At least perf reports 49.47% overhead from div_c while 24.88% overhead from div_asm, so on my computer (Tiger Lake), div r32 is about 2 times faster than div r64.

This is the benchmark code.

#include <stdint.h>
#include <stdio.h>

__attribute__((noinline))
uint32_t div_c(uint64_t a, uint32_t b) {
    uint64_t q = a / b;
    if (q >= 1ull << 32) __builtin_unreachable();
    return q;
}

__attribute__((noinline, naked, sysv_abi))
uint32_t div_asm(uint64_t, uint32_t) {__asm__(
    "mov eax, edi\n\t"
    "mov rdx, rdi\n\t"
    "shr rdx, 32\n\t"
    "div esi\n\t"
    "ret\n\t"
);}

static uint64_t rdtscp() {
    uint32_t _;
    return __builtin_ia32_rdtscp(&_);
}

int main() {
    #define n 500000000ll
    uint64_t c;

    c = rdtscp();
    for (int i = 1; i <= n; ++i) {
        volatile uint32_t _ = div_c(i + n * n, i + n);
    }
    printf("  c%15ul\n", rdtscp() - c);

    c = rdtscp();
    for (int i = 1; i <= n; ++i) {
        volatile uint32_t _ = div_asm(i + n * n, i + n);
    }
    printf("asm%15ul\n", rdtscp() - c);
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

xiv*_*r77 1

这个答案中的每个想法都基于 Nate Eldredge 的评论，从中我发现了gcc扩展内联汇编的一些强大功能。尽管我仍然需要编写汇编，但可以创建自定义的内部函数。

static inline uint32_t divqd(uint64_t a, uint32_t b) {
    if (__builtin_constant_p(b)) {
        return a / b;
    }
    uint32_t lo = a;
    uint32_t hi = a >> 32;
    __asm__("div %2" : "+a" (lo), "+d" (hi) : "rm" (b));    
    return lo;
}

Run Code Online (Sandbox Code Playgroud)

__builtin_constant_p返回1是否b可以在编译时评估。+aand表示从和寄存器 (和)+d读取和写入a值。指定输入可以是寄存器或内存操作数。deaxedxrmb

查看内联和常量传播是否顺利完成，

uint32_t divqd_r(uint64_t a, uint32_t b) {
    return divqd(a, b);
}

divqd_r:
        mov     rdx, rdi
        mov     rax, rdi
        shr     rdx, 32
        div esi
        ret

uint32_t divqd_m(uint64_t a) {
    extern uint32_t b;
    return divqd(a, b);
}

divqd_m:
        mov     rdx, rdi
        mov     rax, rdi
        shr     rdx, 32
        div DWORD PTR b[rip]
        ret

uint32_t divqd_c(uint64_t a) {
    return divqd(a, 12345);
}

divqd_c:
        movabs  rdx, 6120523590596543007
        mov     rax, rdi
        mul     rdx
        shr     rdx, 12
        mov     eax, edx
        ret

Run Code Online (Sandbox Code Playgroud)

结果令人满意（https://godbolt.org/z/47PE4ovMM）。

归档时间：	3 年，10 月前
查看次数：	1064 次
最近记录：	3 年，10 月前