xiv*_*r77 5 c x86 gcc clang compiler-optimization
Currently, from research and various attempts, I'm pretty sure that the only solution to this problem is to use assembly. I'm posting this question to show an existing problem, and maybe get attention from compiler developers, or get some hits from searches about similar problems.
If anything changes in the future, I will accept it as an answer.
This is a very related question for MSVC.
In x86_64 machines, it is faster to use div/idiv with a 32-bit operand than a 64-bit operand. When the dividend is 64-bit and the divisor is 32-bit, and when you know that the quotient will fit in 32 bits, you don't have to use the 64-bit div/idiv. You can split the 64-bit dividend into two 32-bit registers, and even with this overhead, performing a 32-bit div on two 32-bit registers will be faster than doing a 64-bit div with a full 64-bit register.
The compiler will produce a 64-bit div with this function, and that is correct because for a 32-bit div, if the quotient of the division does not fit in 32 bits, an hardware exception occurs.
uint32_t div_c(uint64_t a, uint32_t b) {
return a / b;
}
Run Code Online (Sandbox Code Playgroud)
However, if the quotient is known to be fit in 32 bits, doing a full 64-bit division is unnecessary. I used __builtin_unreachable to tell the compiler about this information, but it doesn't make a difference.
uint32_t div_c_ur(uint64_t a, uint32_t b) {
uint64_t q = a / b;
if (q >= 1ull << 32) __builtin_unreachable();
return q;
}
Run Code Online (Sandbox Code Playgroud)
For both div_c and div_c_ur, the output from gcc is,
mov rax, rdi
mov esi, esi
xor edx, edx
div rsi
ret
Run Code Online (Sandbox Code Playgroud)
clang does an interesting optimization of checking the dividend size, but it still uses a 64-bit div when the dividend is 64-bit.
mov rax, rdi
mov ecx, esi
mov rdx, rdi
shr rdx, 32
je .LBB0_1
xor edx, edx
div rcx
ret
.LBB0_1:
xor edx, edx
div ecx
ret
Run Code Online (Sandbox Code Playgroud)
I had to write straight in assembly to achieve what I want. I couldn't find any other way to do this.
__attribute__((naked, sysv_abi))
uint32_t div_asm(uint64_t, uint32_t) {__asm__(
"mov eax, edi\n\t"
"mov rdx, rdi\n\t"
"shr rdx, 32\n\t"
"div esi\n\t"
"ret\n\t"
);}
Run Code Online (Sandbox Code Playgroud)
Was it worth it? At least perf reports 49.47% overhead from div_c while 24.88% overhead from div_asm, so on my computer (Tiger Lake), div r32 is about 2 times faster than div r64.
This is the benchmark code.
#include <stdint.h>
#include <stdio.h>
__attribute__((noinline))
uint32_t div_c(uint64_t a, uint32_t b) {
uint64_t q = a / b;
if (q >= 1ull << 32) __builtin_unreachable();
return q;
}
__attribute__((noinline, naked, sysv_abi))
uint32_t div_asm(uint64_t, uint32_t) {__asm__(
"mov eax, edi\n\t"
"mov rdx, rdi\n\t"
"shr rdx, 32\n\t"
"div esi\n\t"
"ret\n\t"
);}
static uint64_t rdtscp() {
uint32_t _;
return __builtin_ia32_rdtscp(&_);
}
int main() {
#define n 500000000ll
uint64_t c;
c = rdtscp();
for (int i = 1; i <= n; ++i) {
volatile uint32_t _ = div_c(i + n * n, i + n);
}
printf(" c%15ul\n", rdtscp() - c);
c = rdtscp();
for (int i = 1; i <= n; ++i) {
volatile uint32_t _ = div_asm(i + n * n, i + n);
}
printf("asm%15ul\n", rdtscp() - c);
}
Run Code Online (Sandbox Code Playgroud)
这个答案中的每个想法都基于 Nate Eldredge 的评论,从中我发现了gcc扩展内联汇编的一些强大功能。尽管我仍然需要编写汇编,但可以创建自定义的内部函数。
static inline uint32_t divqd(uint64_t a, uint32_t b) {
if (__builtin_constant_p(b)) {
return a / b;
}
uint32_t lo = a;
uint32_t hi = a >> 32;
__asm__("div %2" : "+a" (lo), "+d" (hi) : "rm" (b));
return lo;
}
Run Code Online (Sandbox Code Playgroud)
__builtin_constant_p返回1是否b可以在编译时评估。+aand表示从 和寄存器 (和)+d读取和写入a值。指定输入可以是寄存器或内存操作数。deaxedxrmb
查看内联和常量传播是否顺利完成,
uint32_t divqd_r(uint64_t a, uint32_t b) {
return divqd(a, b);
}
divqd_r:
mov rdx, rdi
mov rax, rdi
shr rdx, 32
div esi
ret
uint32_t divqd_m(uint64_t a) {
extern uint32_t b;
return divqd(a, b);
}
divqd_m:
mov rdx, rdi
mov rax, rdi
shr rdx, 32
div DWORD PTR b[rip]
ret
uint32_t divqd_c(uint64_t a) {
return divqd(a, 12345);
}
divqd_c:
movabs rdx, 6120523590596543007
mov rax, rdi
mul rdx
shr rdx, 12
mov eax, edx
ret
Run Code Online (Sandbox Code Playgroud)
结果令人满意(https://godbolt.org/z/47PE4ovMM)。
| 归档时间: |
|
| 查看次数: |
1064 次 |
| 最近记录: |