Bee*_*ope 25 c optimization x86 bit-manipulation division
我想实现的无符号一整数除法通过两个任意功率,围捕,高效.所以我想要的,数学上是ceiling(p/q)0.在C中,不利用受限域的strawman实现q可能类似于以下函数1:
/** q must be a power of 2, although this version works for any q */
uint64_t divide(uint64_t p, uint64_t q) {
uint64_t res = p / q;
return p % q == 0 ? res : res + 1;
}
Run Code Online (Sandbox Code Playgroud)
...当然,我实际上并不想在机器级别使用除法或mod,因为即使在现代硬件上也需要很多周期.我正在寻找使用轮班和/或其他一些廉价操作的力量减少 - 利用q2的力量这一事实.
You can assume we have an efficient lg(unsigned int x) function, which returns the base-2 log of x, if x is a power-of-two.
Undefined behavior is fine if q is zero.
Please note that the simple solution: (p+q-1) >> lg(q) doesn't work in general - try it with p == 2^64-100 and q == 2562 for example.
I'm interested in solutions in C, that are likely to perform well across a variety of platforms, but for the sake of concreteness, awarding the bounty and because any definitive discussion of performance needs to include a target architecture, I'll be specific about how I'll test them:
gcc 5.4.0 with compile flags -O3 -march=haswellUsing gcc builtins (such as bitscan/leading zero builtins) is fine, and in particular I've implemented the lg() function I said was available as follows:
inline uint64_t lg(uint64_t x) {
return 63U - (uint64_t)__builtin_clzl(x);
}
inline uint32_t lg32(uint32_t x) {
return 31U - (uint32_t)__builtin_clz(x);
}
Run Code Online (Sandbox Code Playgroud)
I verified that these compile down to a single bsr instruction, at least with -march=haswell, despite the apparent involvement of a subtraction. You are of course free to ignore these and use whatever other builtins you want in your solution.
I wrote a benchmark for the existing answers, and will share and update the results as changes are made.
Writing a good benchmark for a small, potentially inlined operation is quite tough. When code is inlined into a call site, a lot of the work of the function may disappear, especially when it's in a loop3.
您可以通过确保代码未内联来避免整个内联问题:在另一个编译单元中声明它.我尝试用bench二进制文件,但实际上结果是毫无意义的.几乎所有的实现都是每次调用4或5个周期,但即使是一个虚拟方法,除了return 0花费相同的时间之外什么都不做.所以你大多只是测量call + ret开销.此外,你几乎从来没有真正使用过这样的函数 - 除非你搞砸了,否则它们将可用于内联并改变一切.
因此,我将重点关注的两个基准测试反复调用循环中的测试方法,允许内联,跨功能优化,循环提升甚至矢量化.
有两种总体基准类型:延迟和吞吐量.关键的区别在于,在延迟基准测试中,每次调用divide都依赖于之前的调用,因此一般来说调用不能轻易重叠4:
uint32_t bench_divide_latency(uint32_t p, uint32_t q) {
uint32_t total = p;
for (unsigned i=0; i < ITERS; i++) {
total += divide_algo(total, q);
q = rotl1(q);
}
return total;
}
Run Code Online (Sandbox Code Playgroud)
请注意,运行total取决于每个除法调用的输出,并且它也是divide调用的输入.
另一方面,吞吐量变量不会将一个除法的输出提供给后一个除法.这允许来自一个调用的工作与后续调用(由编译器,尤其是CPU)重叠,甚至允许向量化:
uint32_t bench_divide_throughput(uint32_t p, uint32_t q) {
uint32_t total = p;
for (unsigned i=0; i < ITERS; i++) {
total += fname(i, q);
q = rotl1(q);
}
return total;
}
Run Code Online (Sandbox Code Playgroud)
Note that here we feed in the loop counter as the the dividend - this is variable, but it doesn't depend on the previous divide call.
Furthermore, each benchmark has three flavors of behavior for the divisor, q:
divide(p, 8). This is common in practice, and the code can be much simpler when the divisor is known at compile time.Combining everything you get a total of 6 distinct benchmarks.
For the purposes of picking an overall best algorithm, I looked at each of 12 subsets for the proposed algorithms: (latency, throughput) x (constant a, invariant q, variable q) x (32-bit, 64-bit) and assigned a score of 2, 1, or 0 per subtest as follows:
Hence, the maximum total score is 24, but no algorithm achieved that. Here are the overall total results:
?????????????????????????????????
? Algorithm ? Score ?
?????????????????????????????????
? divide_user23_variant ? 20 ?
? divide_chux ? 20 ?
? divide_user23 ? 15 ?
? divide_peter ? 14 ?
? divide_chrisdodd ? 12 ?
? stoke32 ? 11 ?
? divide_chris ? 0 ?
? divide_weather ? 0 ?
?????????????????????????????????
Run Code Online (Sandbox Code Playgroud)
So the for the purposes of this specific test code, with this specific compiler and on this platform, user2357112 "variant" (with ... + (p & mask) != 0) performs best, tied with chux's suggestion (which is in fact identical code).
Here are all the sub-scores which sum to the above:
??????????????????????????????????????????????????????????????????
? ? Total ? LC ? LI ? LV ? TC ? TI ? TV ?
??????????????????????????????????????????????????????????????????
? divide_peter ? 6 ? 1 ? 1 ? 1 ? 1 ? 1 ? 1 ?
? stoke32 ? 6 ? 1 ? 1 ? 2 ? 0 ? 0 ? 2 ?
? divide_chux ? 10 ? 2 ? 2 ? 2 ? 1 ? 2 ? 1 ?
? divide_user23 ? 8 ? 1 ? 1 ? 2 ? 2 ? 1 ? 1 ?
? divide_user23_variant ? 10 ? 2 ? 2 ? 2 ? 1 ? 2 ? 1 ?
? divide_chrisdodd ? 6 ? 1 ? 1 ? 2 ? 0 ? 0 ? 2 ?
? divide_chris ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ?
? divide_weather ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ?
? ? ? ? ? ? ? ? ?
? 64-bit Algorithm ? ? ? ? ? ? ? ?
? divide_peter_64 ? 8 ? 1 ? 1 ? 1 ? 2 ? 2 ? 1 ?
? div_stoke_64 ? 5 ? 1 ? 1 ? 2 ? 0 ? 0 ? 1 ?
? divide_chux_64 ? 10 ? 2 ? 2 ? 2 ? 1 ? 2 ? 1 ?
? divide_user23_64 ? 7 ? 1 ? 1 ? 2 ? 1 ? 1 ? 1 ?
? divide_user23_variant_64 ? 10 ? 2 ? 2 ? 2 ? 1 ? 2 ? 1 ?
? divide_chrisdodd_64 ? 6 ? 1 ? 1 ? 2 ? 0 ? 0 ? 2 ?
? divide_chris_64 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ?
? divide_weather_64 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ? 0 ?
??????????????????????????????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)
Here, each test is named like XY, with X in {Latency, Throughput} and Y in {Constant Q, Invariant Q, Variable Q}. So for example, LC is "Latency test with constant q".
At the highest level, the solutions can be roughly divided into two categories: fast (the top 6 finishers) and slow (the bottom two). The difference is larger: all of the fast algorithms were the fastest on at least two subtests and in general when they didn't finish first they fell into the "close enough" category (they only exceptions being failed vectorizations in the case of stoke and chrisdodd). The slow algorithms however scored 0 (not even close) on every test. So you can mostly eliminate the slow algorithms from further consideration.
Among the fast algorithms, a large differentiator was the ability to auto-vectorize.
None of the algorithms were able to auto-vectorize in the latency tests, which makes sense since the latency tests are designed to feed their result directly into the next iteration. So you can really only calculate results in a serial fashion.
For the throughput tests, however, many algorithms were able to auto-vectorize for the constant Q and invariant Q case. In both of these tests tests the divisor q is loop-invariant (and in the former case it is a compile-time constant). The dividend is the loop counter, so it is variable, but predicable (and in particular a vector of dividends can be trivially calculated by adding 8 to the previous input vector: [0, 1, 2, ..., 7] + [8, 8, ..., 8] == [8, 9, 10, ..., 15]).
In this scenario, gcc was able to vectorize peter, stoke, chux, user23 and user23_variant. It wasn't able to vectorize chrisdodd for some reason, likely because it included a branch (but conditionals don't strictly prevent vectorization since many other solutions have conditional elements but still vectorized). The impact was huge: algorithms that vectorized showed about an 8x improvement in throughput over variants that didn't but were otherwise fast.
Vectorization isn't free, though! Here are the function sizes for the "constant" variant of each function, with the Vec? column showing whether a function vectorized or not:
Size Vec? Name
045 N bench_c_div_stoke_64
049 N bench_c_divide_chrisdodd_64
059 N bench_c_stoke32_64
212 Y bench_c_divide_chux_64
227 Y bench_c_divide_peter_64
220 Y bench_c_divide_user23_64
212 Y bench_c_divide_user23_variant_64
Run Code Online (Sandbox Code Playgroud)
The trend is clear - vectorized functions take about 4x the size of the non-vectorized ones. This is both because the core loops themselves are larger (vector instructions tend to be larger and there are more of them), and because loop setup and especially the post-loop code is much larger: for example, the vectorized version requires a reduction to sum all the partial sums in a vector. The loop count is fixed and a multiple of 8, so no tail code is generated - but if were variable the generated code would be even larger.
Furthermore, despite the large improvement in runtime, gcc's vectorization is actually poor. Here's an excerpt from the vectorized version of Peter's routine:
on entry: ymm4 == all zeros
on entry: ymm5 == 0x00000001 0x00000001 0x00000001 ...
4007a4: c5 ed 76 c4 vpcmpeqd ymm0,ymm2,ymm4
4007ad: c5 fd df c5 vpandn ymm0,ymm0,ymm5
4007b1: c5 dd fa c0 vpsubd ymm0,ymm4,ymm0
4007b5: c5 f5 db c0 vpand ymm0,ymm1,ymm0
Run Code Online (Sandbox Code Playgroud)
This chunk works independently on 8 DWORD elements originating in ymm2. If we take x to be a single DWORD element of ymm2, and y the incoming value of ymm1 these foud instructions correspond to:
x == 0 x != 0
x = x ? 0 : -1; // -1 0
x = x & 1; // 1 0
x = 0 - x; // -1 0
x = y1 & x; // y1 0
Run Code Online (Sandbox Code Playgroud)
So the first three instructions could simple be replaced by the first one, as the states are identical in either case. So that's two cycles added to that dependency chain (which isn't loop carried) and two extra uops. Evidently gcc's optimization phases somehow interact poorly with the vectorization code here, since such trivial optimizations are rarely missed in scalar code. Examining the other vectorized versions similarly shows a lot of performance dropped on the floor.
Nearly all of the solutions compiled to branch-free code, even if C code had conditionals or explicit branches. The conditional portions were small enough that the compiler generally decided to use conditional move or some variant. One exception is chrisdodd which compiled with a branch (checking if p == 0) in all the throughput tests, but none of the latency ones. Here's a typical example from the constant q throughput test:
0000000000400e60 <bench_c_divide_chrisdodd_32>:
400e60: 89 f8 mov eax,edi
400e62: ba 01 00 00 00 mov edx,0x1
400e67: eb 0a jmp 400e73 <bench_c_divide_chrisdodd_32+0x13>
400e69: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0]
400e70: 83 c2 01 add edx,0x1
400e73: 83 fa 01 cmp edx,0x1
400e76: 74 f8 je 400e70 <bench_c_divide_chrisdodd_32+0x10>
400e78: 8d 4a fe lea ecx,[rdx-0x2]
400e7b: c1 e9 03 shr ecx,0x3
400e7e: 8d 44 08 01 lea eax,[rax+rcx*1+0x1]
400e82: 81 fa 00 ca 9a 3b cmp edx,0x3b9aca00
400e88: 75 e6 jne 400e70 <bench_c_divide_chrisdodd_32+0x10>
400e8a: c3 ret
400e8b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
Run Code Online (Sandbox Code Playgroud)
The branch at 400e76 skips the case that p == 0. In fact, the compiler could have just peeled the first iteration out (calculating its result explicitly) and then avoided the jump entirely since after that it can prove that p != 0. In these tests, the branch is perfectly predictable, which could give an advantage to code that actually compiles using a branch (since the compare & branch code is essentially out of line and close to free), and is a big part of why chrisdodd wins the throughput, variable q case.
Here you can find some detailed test results and some details on the tests themselves.
The results below test each algorithm over 1e9 iterations. Cycles are calculated simply by multiplying the time/call by the clock frequency. You can generally assume that something like 4.01 is the same as 4.00, but the larger deviations like 5.11 seem to be real and reproducible.
The results for divide_plusq_32 use (p + q - 1) >> lg(q) but are only shown for reference, since this function fails for large p + q. The results for dummy are a very simple function: return p + q, and lets you estimate the benchmark overhead5 (the addition itself should take a cycle at most).
==============================
Bench: Compile-time constant Q
==============================
Function ns/call cycles
divide_peter_32 2.19 5.67
divide_peter_64 2.18 5.64
stoke32_32 1.93 5.00
stoke32_64 1.97 5.09
stoke_mul_32 2.75 7.13
stoke_mul_64 2.34 6.06
div_stoke_32 1.94 5.03
div_stoke_64 1.94 5.03
divide_chux_32 1.55 4.01
divide_chux_64 1.55 4.01
divide_user23_32 1.97 5.11
divide_user23_64 1.93 5.00
divide_user23_variant_32 1.55 4.01
divide_user23_variant_64 1.55 4.01
divide_chrisdodd_32 1.95 5.04
divide_chrisdodd_64 1.93 5.00
divide_chris_32 4.63 11.99
divide_chris_64 4.52 11.72
divide_weather_32 2.72 7.04
divide_weather_64 2.78 7.20
divide_plusq_32 1.16 3.00
divide_plusq_64 1.16 3.00
divide_dummy_32 1.16 3.00
divide_dummy_64 1.16 3.00
==============================
Bench: Invariant Q
==============================
Function ns/call cycles
divide_peter_32 2.19 5.67
divide_peter_64 2.18 5.65
stoke32_32 1.93 5.00
stoke32_64 1.93 5.00
stoke_mul_32 2.73 7.08
stoke_mul_64 2.34 6.06
div_stoke_32 1.93 5.00
div_stoke_64 1.93 5.00
divide_chux_32 1.55 4.02
divide_chux_64 1.55 4.02
divide_user23_32 1.95 5.05
divide_user23_64 2.00 5.17
divide_user23_variant_32 1.55 4.02
divide_user23_variant_64 1.55 4.02
divide_chrisdodd_32 1.95 5.04
divide_chrisdodd_64 1.93 4.99
divide_chris_32 4.60 11.91
divide_chris_64 4.58 11.85
divide_weather_32 12.54 32.49
divide_weather_64 17.51 45.35
divide_plusq_32 1.16 3.00
divide_plusq_64 1.16 3.00
divide_dummy_32 0.39 1.00
divide_dummy_64 0.39 1.00
==============================
Bench: Variable Q
==============================
Function ns/call cycles
divide_peter_32 2.31 5.98
divide_peter_64 2.26 5.86
stoke32_32 2.06 5.33
stoke32_64 1.99 5.16
stoke_mul_32 2.73 7.06
stoke_mul_64 2.32 6.00
div_stoke_32 2.00 5.19
div_stoke_64 2.00 5.19
divide_chux_32 2.04 5.28
divide_chux_64 2.05 5.30
divide_user23_32 2.05 5.30
divide_user23_64 2.06 5.33
divide_user23_variant_32 2.04 5.29
divide_user23_variant_64 2.05 5.30
divide_chrisdodd_32 2.04 5.30
divide_chrisdodd_64 2.05 5.31
divide_chris_32 4.65 12.04
divide_chris_64 4.64 12.01
divide_weather_32 12.46 32.28
divide_weather_64 19.46 50.40
divide_plusq_32 1.93 5.00
divide_plusq_64 1.99 5.16
divide_dummy_32 0.40 1.05
divide_dummy_64 0.40 1.04
Run Code Online (Sandbox Code Playgroud)
Here are the results for the throughput tests. Note that many of the algorithms here were auto-vectorized, so the performance is relatively very good for those: a fraction of a cycle in many cases. One result is that unlike most latency results, the 64-bit functions are considerably slower, since vectorization is more effective with smaller element sizes (although the gap is larger that I would have expected).
==============================
Bench: Compile-time constant Q
==============================
Function ns/call cycles
stoke32_32 0.39 1.00
divide_chux_32 0.15 0.39
divide_chux_64 0.53 1.37
divide_user23_32 0.14 0.36
divide_user23_64 0.53 1.37
divide_user23_variant_32 0.15 0.39
divide_user23_variant_64 0.53 1.37
divide_chrisdodd_32 1.16 3.00
divide_chrisdodd_64 1.16 3.00
divide_chris_32 4.34 11.23
divide_chris_64 4.34 11.24
divide_weather_32 1.35 3.50
divide_weather_64 1.35 3.50
divide_plusq_32 0.10 0.26
divide_plusq_64 0.39 1.00
divide_dummy_32 0.08 0.20
divide_dummy_64 0.39 1.00
==============================
Bench: Invariant Q
==============================
Function ns/call cycles
stoke32_32 0.48 1.25
divide_chux_32 0.15 0.39
divide_chux_64 0.48 1.25
divide_user23_32 0.17 0.43
divide_user23_64 0.58 1.50
divide_user23_variant_32 0.15 0.38
divide_user23_variant_64 0.48 1.25
divide_chrisdodd_32 1.16 3.00
divide_chrisdodd_64 1.16 3.00
divide_chris_32 4.35 11.26
divide_chris_64 4.36 11.28
divide_weather_32 5.79 14.99
divide_weather_64 17.00 44.02
divide_plusq_32 0.12 0.31
divide_plusq_64 0.48 1.25
divide_dummy_32 0.09 0.23
divide_dummy_64 0.09 0.23
==============================
Bench: Variable Q
==============================
Function ns/call cycles
stoke32_32 1.16 3.00
divide_chux_32 1.36 3.51
divide_chux_64 1.35 3.50
divide_user23_32 1.54 4.00
divide_user23_64 1.54 4.00
divide_user23_variant_32 1.36 3.51
divide_user23_variant_64 1.55 4.01
divide_chrisdodd_32 1.16 3.00
divide_chrisdodd_64 1.16 3.00
divide_chris_32 4.02 10.41
divide_chris_64 3.84 9.95
divide_weather_32 5.40 13.98
divide_weather_64 19.04 49.30
divide_plusq_32 1.03 2.66
divide_plusq_64 1.03 2.68
divide_dummy_32 0.63 1.63
divide_dummy_64 0.66 1.71
Run Code Online (Sandbox Code Playgroud)
a At least by specifying unsigned we avoid the whole can of worms related to the right-shift behavior of signed integers in C and C++.
0 Of course, this notation doesn't actually work in C where / truncates the result so the ceiling does nothing. So consider that pseudo-notation rather than straight C.
1 I'm also interested solutions where all types are uint32_t rather than uint64_t.
2 In general, any p and q where p + q >= 2^64 causes an issue, due to overflow.
3 That said, the function should be in a loop, because the performance of a microscopic function that takes half a dozen cycles only really matters if it is called in a fairly tight loop.
4 This is a bit of a simplification - only the dividend p is dependent on the output of the previous iteration, so some work related to processing of q can still be overlapped.
5 Use such estimates with caution however - overhead isn't simply additive. If the overhead shows up as 4 cycles and some function f takes 5, it's likely not accurate to say the cost of the real work in f is 5 - 4 == 1, because of the way execution is overlapped.
Pet*_*des 16
This answer is about what's ideal in asm; what we'd like to convince the compiler to emit for us. (I'm not suggesting actually using inline asm, except as a point of comparison when benchmarking compiler output. https://gcc.gnu.org/wiki/DontUseInlineAsm).
I did manage to get pretty good asm output from pure C for ceil_div_andmask, see below. (It's worse than a CMOV on Broadwell/Skylake, but probably good on Haswell. Still, the user23/chux version looks even better for both cases.) It's mostly just worth mentioning as one of the few cases where I got the compiler to emit the asm I wanted.
It looks like Chris Dodd's general idea of return ((p-1) >> lg(q)) + 1 with special-case handling for d=0 is one of the best options. I.e. the optimal implementation of it in asm is hard to beat with an optimal implementation of anything else. Chux's (p >> lg(q)) + (bool)(p & (q-1)) also has advantages (like lower latency from p->result), and more CSE when the same q is used for multiple divisions. See below for a latency/throughput analysis of what gcc does with it.
If the same e = lg(q) is reused for multiple dividends, or the same dividend is reused for multiple divisors, different implementations can CSE more of the expression. They can also effectively vectorize with AVX2.
Branches are cheap and very efficient if they predict very well, so branching on d==0 will be best if it's almost never taken. If d==0 is not rare, branchless asm will perform better on average. Ideally we can write something in C that will let gcc make the right choice during profile-guided optimization, and compiles to good asm for either case.
Since the best branchless asm implementations don't add much latency vs. a branchy implementation, branchless is probably the way to go unless the branch would go the same way maybe 99% of the time. This might be likely for branching on p==0, but probably less likely for branching on p & (q-1).
It's hard to guide gcc5.4 into emitting anything optimal. This is my work-in-progress on Godbolt).
I think the optimal sequences for Skylake for this algorithm are as follows. (Shown as stand-alone functions for the AMD64 SysV ABI, but talking about throughput/latency on the assumption that the compiler will emit something similar inlined into a loop, with no RET attached).
Branch on carry from calculating d-1 to detect d==0, instead of a separate test & branch. Reduces the uop count nicely, esp on SnB-family where JC can macro-fuse with SUB.
ceil_div_pjc_branch:
xor eax,eax ; can take this uop off the fast path by adding a separate xor-and-return block, but in reality we want to inline something like this.
sub rdi, 1
jc .d_was_zero ; fuses with the sub on SnB-family
tzcnt rax, rsi ; tzcnt rsi,rsi also avoids any false-dep problems, but this illustrates that the q input can be read-only.
shrx rax, rdi, rax
inc rax
.d_was_zero:
ret
Run Code Online (Sandbox Code Playgroud)
I've tried but failed to get gcc to branch on CF from the subtract, but it always wants to do a separate comparison. I know gcc can be coaxed into branching on CF after subtracting two variables, but maybe this fails if one is a compile-time constant. (IIRC, this typically compiles to a CF test with unsigned vars: foo -= bar; if(foo>bar) carry_detected = 1;)
Branchless with ADC/SBB to handle the d==0 case. Zero-handling adds only one instruction to the critical path (vs. a version with no special handling for d==0), but also converts one other from an INC to a sbb rax, -1 to make CF undo the -= -1. Using a CMOV is cheaper on pre-Broadwell, but takes extra instructions to set it up.
ceil_div_pjc_asm_adc:
tzcnt rsi, rsi
sub rdi, 1
adc rdi, 0 ; d? d-1 : d. Sets CF=CF
shrx rax, rdi, rsi
sbb rax, -1 ; result++ if d was non-zero
ret
Run Code Online (Sandbox Code Playgroud)
If q and d become ready at the same time, note that this version can run SUB/ADC in parallel with the 3c latency of TZCNT. If both are coming from the same cache-miss cache line, it's certainly possible. In any case, introducing the dep on q as late as possible in the d->result dependency chain is an advantage.
gcc5.4似乎不太可能从C中获取此信息.有一个固有的附加携带,但gcc使它完全混乱.它不使用ADC或SBB的立即操作数,并将进位存储在每个操作之间的整数寄存器中.gcc7,clang3.9和icc17都是由此产生的可怕代码.
#include <x86intrin.h>
// compiles to completely horrible code, putting the flags into integer regs between ops.
T ceil_div_adc(T d, T q) {
T e = lg(q);
unsigned long long dm1; // unsigned __int64
unsigned char CF = _addcarry_u64(0, d, -1, &dm1);
CF = _addcarry_u64(CF, 0, dm1, &dm1);
T shifted = dm1 >> e;
_subborrow_u64(CF, shifted, -1, &dm1);
return dm1;
}
# gcc5.4 -O3 -march=haswell
mov rax, -1
tzcnt rsi, rsi
add rdi, rax
setc cl
xor edx, edx
add cl, -1
adc rdi, rdx
setc dl
shrx rdi, rdi, rsi
add dl, -1
sbb rax, rdi
ret
Run Code Online (Sandbox Code Playgroud)
CMOV用于修复整个结果:q->结果的延迟更差,因为它在d->结果dep链中使用得更快.
ceil_div_pjc_asm_cmov:
tzcnt rsi, rsi
sub rdi, 1
shrx rax, rdi, rsi
lea rax, [rax+1] ; inc preserving flags
cmovc rax, zeroed_register
ret
Run Code Online (Sandbox Code Playgroud)
CMOV to get an operand for SUB: I was hoping I could find a way to create an operand for a later instruction that would produce a zero when needed, without an input dependency on q, e, or the SHRX result. This would help if d is ready before q, or at the same time.
This doesn't achieve that goal, and needs an extra 7-byte mov rdx,-1 in the loop.
ceil_div_pjc_asm_cmov:
tzcnt rsi, rsi
mov rdx, -1
sub rdi, 1
shrx rax, rdi, rsi
cmovnc rdx, rax
sub rax, rdx ; res += d ? 1 : -res
ret
Run Code Online (Sandbox Code Playgroud)
Lower-latency version for pre-BDW CPUs with expensive CMOV, using SETCC to create a mask for AND.
ceil_div_pjc_asm_setcc:
xor edx, edx ; needed every iteration
tzcnt rsi, rsi
sub rdi, 1
setc dl ; d!=0 ? 0 : 1
dec rdx ; d!=0 ? -1 : 0 // AND-mask
shrx rax, rdi, rsi
inc rax
and rax, rdx ; zero the bogus result if d was initially 0
ret
Run Code Online (Sandbox Code Playgroud)
Still 4c latency from d->result (and 6 from q->result), because the SETC/DEC happen in parallel with the SHRX/INC. Total uop count: 8. Most of these insns can run on any port, so it should be 1 iter per 2 clocks.
Of course, for pre-HSW, you also need to replace SHRX.
We can get gcc5.4 to emit something nearly as good: (still uses a separate TEST instead of setting mask based on sub rdi, 1, but otherwise the same instructions as above). See it on Godbolt.
T ceil_div_andmask(T p, T q) {
T mask = -(T)(p!=0); // TEST+SETCC+NEG
T e = lg(q);
T nonzero_result = ((p-1) >> e) + 1;
return nonzero_result & mask;
}
Run Code Online (Sandbox Code Playgroud)
When the compiler knows that p is non-zero, it takes advantage and makes nice code:
// http://stackoverflow.com/questions/40447195/can-i-hint-the-optimizer-by-giving-the-range-of-an-integer
#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
#define assume(x) do{if(!(x)) __builtin_unreachable();}while(0)
#else
#define assume(x) (void)(x) // still evaluate it once, for side effects in case anyone is insane enough to put any inside an assume()
#endif
T ceil_div_andmask_nonzerop(T p, T q) {
assume(p!=0);
return ceil_div_andmask(p, q);
}
# gcc5.4 -O3 -march=haswell
xor eax, eax # gcc7 does tzcnt in-place instead of wasting an insn on this
sub rdi, 1
tzcnt rax, rsi
shrx rax, rdi, rax
add rax, 1
ret
Run Code Online (Sandbox Code Playgroud)
only 3c latency from p->result, and constant q can CSE a lot.
T divide_A_chux(T p, T q) {
bool round_up = p & (q-1); // compiles differently from user23_variant with clang: AND instead of
return (p >> lg(q)) + round_up;
}
xor eax, eax # in-place tzcnt would save this
xor edx, edx # target for setcc
tzcnt rax, rsi
sub rsi, 1
test rsi, rdi
shrx rdi, rdi, rax
setne dl
lea rax, [rdx+rdi]
ret
Run Code Online (Sandbox Code Playgroud)
Doing the SETCC before TZCNT would allow an in-place TZCNT, saving the xor eax,eax. I haven't looked at how this inlines in a loop.
xor eax,eax should speed this up to one per 1.75c (but of course any loop overh
use*_*ica 11
uint64_t exponent = lg(q);
uint64_t mask = q - 1;
// v divide
return (p >> exponent) + (((p & mask) + mask) >> exponent)
// ^ round up
Run Code Online (Sandbox Code Playgroud)
"向上"部分的单独计算避免了溢出问题(p+q-1) >> lg(q).根据编译器的智能程度,可以将"向上"部分表示为((p & mask) != 0)不分支.
对于C中的无符号整数,除以2的幂的有效方式是右移 - 右移一除二(向下舍入),因此右移n除以2 n(向下舍入).
现在你想要向上舍入而不是向下舍入,你可以通过首先加上2 n -1,或等效地在移位之前减去一个并在之后加一个(除了0)来做.这有点类似于:
unsigned ceil_div(unsigned d, unsigned e) {
/* compute ceil(d/2**e) */
return d ? ((d-1) >> e) + 1 : 0;
}
Run Code Online (Sandbox Code Playgroud)
可以通过使用d的布尔值来删除条件,用于加1和减1:
unsigned ceil_div(unsigned d, unsigned e) {
/* compute ceil(d/2**e) */
return ((d - !!d) >> e) + !!d;
}
Run Code Online (Sandbox Code Playgroud)
由于其尺寸和速度要求,该功能应该是静态内联的.它可能不会对优化器产生不同,但参数应该是const.如果必须在许多文件之间共享,请在标头中定义它:
static inline unsigned ceil_div(const unsigned d, const unsigned e){...
Run Code Online (Sandbox Code Playgroud)
有效地将无符号值除以2的幂,四舍五入
当溢出不是问题时,圆形或天花板部分很容易.简单添加q-1,然后转移.
否则,因为舍入的可能性取决于所有p小于q的位,所以在它们被移出之前首先需要检测那些位.
uint64_t divide_A(uint64_t p, uint64_t q) {
bool round_up = p & (q-1);
return (p >> lg64(q)) + round_up;
}
Run Code Online (Sandbox Code Playgroud)
这假设代码具有一个有效的lg64(uint64_t x)函数,它返回base-2 log x,如果x是2 的幂
如果p只有两个(哎呀)的力量,我的旧答案不起作用.所以我的新的解决方案,使用__builtin_ctzll()和__builtin_ffsll()功能0提供gcc(其中作为奖金,提供你所提到的快数!):
uint64_t divide(uint64_t p,uint64_t q) {
int lp=__builtin_ffsll(p);
int lq=__builtin_ctzll(q);
return (p>>lq)+(lp<(lq+1)&&lp);
}
Run Code Online (Sandbox Code Playgroud)
注意,这假设a long long是64位.它必须稍微调整一下.
这里的想法是,如果我们需要溢出,当且仅当p尾随零的数量少于q.请注意,对于2的幂,尾随零的数量等于对数,因此我们也可以将此内置函数用于日志.
该&&lp部分仅适用于p零的情况:否则它将1在此处输出.
0不能同时__builtin_ctzll()用于两者,因为它是未定义的p==0.
如果可以保证红利/除数不超过63(或31)位,您可以使用问题中提到的以下版本.注意如果p + q使用全部64位,它们将如何溢出.如果SHR指令在进位标志中移位,这将没有问题,但AFAIK则没有.
uint64_t divide(uint64_t p, uint64_t q) {
return (p + q - 1) >> lg(q);
}
Run Code Online (Sandbox Code Playgroud)
如果无法保证这些约束,您可以只使用floor方法,然后添加1,如果它将向上舍入.这可以通过检查被除数中的任何位是否在除数范围内来确定.注意:p&-p提取2s补码机或BLSI指令的最低设置位
uint64_t divide(uint64_t p, uint64_t q) {
return (p >> lg(q)) + ( (p & -p ) < q );
}
Run Code Online (Sandbox Code Playgroud)
哪个clang编译为:
bsrq %rax, %rsi
shrxq %rax, %rdi, %rax
blsiq %rdi, %rcx
cmpq %rsi, %rcx
adcq $0, %rax
retq
Run Code Online (Sandbox Code Playgroud)
这有点罗嗦并使用一些较新的指令,所以也许有一种方法可以在原始版本中使用进位标志.让我们看看:RCR指令确实如此,但似乎会更糟......也许是SHRD指令......它会是这样的(此刻无法测试)
xor edx, edx ;edx = 0 (will store the carry flag)
bsr rcx, rsi ;rcx = lg(q) ... could be moved anywhere before shrd
lea rax, [rsi-1] ;rax = q-1 (adding p could carry)
add rax, rdi ;rax += p (handle carry)
setc dl ;rdx = carry flag ... or xor rdx and setc
shrd rax, rdx, cl ;rax = rdx:rax >> cl
ret
Run Code Online (Sandbox Code Playgroud)
另外1条指令,但应与旧处理器兼容(如果它可以工作......我总是得到一个源/目的地交换 - 随意编辑)
附录:
我已经实现了lg()函数我说的可用如下:
Run Code Online (Sandbox Code Playgroud)inline uint64_t lg(uint64_t x) { return 63U - (uint64_t)__builtin_clzl(x); } inline uint32_t lg32(uint32_t x) { return 31U - (uint32_t)__builtin_clz(x); }
快速日志功能没有完全优化到clang和ICC上的bsr,但你可以这样做:
#if defined(__x86_64__) && (defined(__clang__) || defined(__INTEL_COMPILER))
static inline uint64_t lg(uint64_t x){
inline uint64_t ret;
//other compilers may want bsrq here
__asm__("bsr %0, %1":"=r"(ret):"r"(x));
return ret;
}
#endif
#if defined(__i386__) && (defined(__clang__) || defined(__INTEL_COMPILER))
static inline uint32_t lg32(uint32_t x){
inline uint32_t ret;
__asm__("bsr %0, %1":"=r"(ret):"r"(x));
return ret;
}
#endif
Run Code Online (Sandbox Code Playgroud)
已经有很多人的脑力应用于此问题,C彼得·科德斯(Peter Cordes)的答案涵盖了您可能希望在asm中获得的最好成绩,并提供了一些尝试将其映射回的注意事项C。
因此,当人类大吃一惊时,我想看看有什么蛮力的计算能力要说!为此,我使用了斯坦福大学的STOKE超级优化器来尝试找到此问题的32位和64位版本的好的解决方案。
通常,超级优化程序通常类似于蛮力搜索所有可能的指令序列,直到您通过某种度量找到最佳的。当然,如果有大约1,000条指令,则很快就会失去控制,导致超过几条指令1。另一方面,STOKE采用了一种指导性的随机方法:它随机地对现有的候选程序进行变异,并在每个步骤中评估一个兼顾性能和正确性的成本函数。不管怎么说,那是一线- 如果引起您的好奇心,有很多文件。
因此,在几分钟内,STOKE找到了一些非常有趣的解决方案。它发现了现有解决方案中几乎所有的高级想法,以及一些独特的想法。例如,对于32位函数,STOKE找到了以下版本:
neg rsi
dec rdi
pext rax, rsi, rdi
inc eax
ret
Run Code Online (Sandbox Code Playgroud)
它根本不使用任何前导/后置零计数或移位指令。差不多,它用于neg esi将除数转换为高位为1的掩码,然后pext使用该掩码有效地进行移位。在该技巧之外,它使用的是用户QuestionC所使用的技巧:递减p,移位,递增p-但即使对于零分红,它也可以正常工作,因为它使用64位寄存器来区分零位和MSB设置的p大写。 。
我将此算法的C版本添加到基准测试中,并将其添加到结果中。它与其他好的算法相比具有竞争力,在“ Variable Q”案例中排名第一。它确实可以向量化,但不如其他32位算法好,因为它需要64位数学运算,因此向量一次只能处理一半的元素。
更好的是,在32位的情况下,它提出了多种解决方案,这些解决方案利用以下事实:如果您使用64位ops的一部分,则某些在边缘情况下失败的直观解决方案会“正常工作”。例如:
tzcntl ebp, esi
dec esi
add rdi, rsi
sarx rax, rdi, rbp
ret
Run Code Online (Sandbox Code Playgroud)
这相当于return (p + q - 1) >> lg(q)我在问题中提到的建议。一般来说,这不起作用,因为p + q它会大量溢出,但是对于32位p,q此解决方案通过在64位中执行重要部分而非常有效。通过一些强制转换将其转换回C,实际上它发现使用lea会在一条指令1中进行加法:
stoke_32(unsigned int, unsigned int):
tzcnt edx, esi
mov edi, edi ; goes away when inlining
mov esi, esi ; goes away when inlining
lea rax, [rsi-1+rdi]
shrx rax, rax, rdx
ret
Run Code Online (Sandbox Code Playgroud)
因此,当内联到已经具有零扩展到rdi和中的值的东西中时,这是一个3指令的解决方案rsi。独立函数定义需要mov零扩展指令,因为这就是SysV x64 ABI的工作方式。
对于64位功能,它并没有提出任何能够破坏现有解决方案的东西,但确实提出了一些简洁的东西,例如:
tzcnt r13, rsi
tzcnt rcx, rdi
shrx rax, rdi, r13
cmp r13b, cl
adc rax, 0
ret
Run Code Online (Sandbox Code Playgroud)
那家伙计算两个参数的尾随零,然后如果q尾随零数少于p,则将1加到结果中,因为那是您需要取整的时候。聪明!
一般情况下,它的理解是你需要的想法shl在tzcnt真的很快(就像大多数人),然后用一吨的其他解决方案来调整结果占四舍五入的问题上来。它甚至设法利用blsi和bzhi几个解决方案。这是它提出的5条指令解决方案:
tzcnt r13, rsi
shrx rax, rdi, r13
imul rsi, rax
cmp rsi, rdi
adc rax, 0
ret
Run Code Online (Sandbox Code Playgroud)
这基本上是一种“乘法并验证”的方法-将截断res = p \ q后的值乘回,如果与加法运算符不同p:return res * q == p ? ret : ret + 1。凉。不过,这并不比Peter的解决方案更好。STOKE在延迟计算中似乎存在一些缺陷-它认为上述延迟为5-但根据架构的不同,它更像是8或9。因此,有时会基于有缺陷的延迟计算来缩小解决方案的范围。
1有趣的是,尽管这种蛮力方法在5到6条指令之间达到了可行性:如果您假设可以通过消除SIMD和x87指令将指令数减少到300,那么您将需要大约28天的时间才能尝试使用300 ^ 55条指令序列1,000,000个候选人/秒。您可以通过各种优化将其减少1000倍,这意味着5条指令序列的时间少于一个小时,而6条指令的时间可能不到一周。碰巧的是,针对此问题的大多数最佳解决方案都落在5-6条指令窗口中...
2这将是一个缓慢的过程lea,因此STOKE发现的序列对于我进行了优化的延迟仍然是最佳的。