有效地将无符号值除以2的幂,四舍五入

Bee*_*ope 25 c optimization x86 bit-manipulation division

我想实现的无符号整数除法通过两个任意功率,围捕,高效.所以我想要的,数学上是ceiling(p/q)0.在C中,不利用受限域的strawman实现q可能类似于以下函数1:

/** q must be a power of 2, although this version works for any q */
uint64_t divide(uint64_t p, uint64_t q) {
  uint64_t res = p / q;
  return p % q == 0 ? res : res + 1;
} 
Run Code Online (Sandbox Code Playgroud)

...当然,我实际上并不想在机器级别使用除法或mod,因为即使在现代硬件上也需要很多周期.我正在寻找使用轮班和/或其他一些廉价操作的力量减少 - 利用q2的力量这一事实.

You can assume we have an efficient lg(unsigned int x) function, which returns the base-2 log of x, if x is a power-of-two.

Undefined behavior is fine if q is zero.

Please note that the simple solution: (p+q-1) >> lg(q) doesn't work in general - try it with p == 2^64-100 and q == 2562 for example.

Platform Details

I'm interested in solutions in C, that are likely to perform well across a variety of platforms, but for the sake of concreteness, awarding the bounty and because any definitive discussion of performance needs to include a target architecture, I'll be specific about how I'll test them:

  • Skylake CPU
  • gcc 5.4.0 with compile flags -O3 -march=haswell

Using gcc builtins (such as bitscan/leading zero builtins) is fine, and in particular I've implemented the lg() function I said was available as follows:

inline uint64_t lg(uint64_t x) {
  return 63U - (uint64_t)__builtin_clzl(x);
}

inline uint32_t lg32(uint32_t x) {
  return 31U - (uint32_t)__builtin_clz(x);
}
Run Code Online (Sandbox Code Playgroud)

I verified that these compile down to a single bsr instruction, at least with -march=haswell, despite the apparent involvement of a subtraction. You are of course free to ignore these and use whatever other builtins you want in your solution.

Benchmark

I wrote a benchmark for the existing answers, and will share and update the results as changes are made.

Writing a good benchmark for a small, potentially inlined operation is quite tough. When code is inlined into a call site, a lot of the work of the function may disappear, especially when it's in a loop3.

可以通过确保代码未内联来避免整个内联问题:在另一个编译单元中声明它.我尝试用bench二进制文件,但实际上结果是毫无意义的.几乎所有的实现都是每次调用4或5个周期,但即使是一个虚拟方法,除了return 0花费相同的时间之外什么都不做.所以你大多只是测量call + ret开销.此外,你几乎从来没有真正使用过这样的函数 - 除非你搞砸了,否则它们将可用于内联并改变一切.

因此,我将重点关注的两个基准测试反复调用循环中的测试方法,允许内联,跨功能优化,循环提升甚至矢量化.

有两种总体基准类型:延迟和吞吐量.关键的区别在于,在延迟基准测试中,每次调用divide都依赖于之前的调用,因此一般来说调用不能轻易重叠4:

uint32_t bench_divide_latency(uint32_t p, uint32_t q) {
    uint32_t total = p;                         
    for (unsigned i=0; i < ITERS; i++) {                
      total += divide_algo(total, q);                       
      q = rotl1(q);                         
    }
    return total;
  }
Run Code Online (Sandbox Code Playgroud)

请注意,运行total取决于每个除法调用的输出,并且它也是divide调用的输入.

另一方面,吞吐量变量不会将一个除法的输出提供给后一个除法.这允许来自一个调用的工作与后续调用(由编译器,尤其是CPU)重叠,甚至允许向量化:

uint32_t bench_divide_throughput(uint32_t p, uint32_t q) { 
    uint32_t total = p;                         
    for (unsigned i=0; i < ITERS; i++) {                
      total += fname(i, q);                     
      q = rotl1(q);                     
    }                                   
    return total;                           
  }
Run Code Online (Sandbox Code Playgroud)

Note that here we feed in the loop counter as the the dividend - this is variable, but it doesn't depend on the previous divide call.

Furthermore, each benchmark has three flavors of behavior for the divisor, q:

  1. Compile-time constant divisor. For example, a call to divide(p, 8). This is common in practice, and the code can be much simpler when the divisor is known at compile time.
  2. Invariant divisor. Here the divisor is not know at compile time, but is constant for the whole benchmarking loop. This allows a subset of the optimizations that the compile-time constant does.
  3. Variable divisor. The divisor changes on each iteration of the loop. The benchmark functions above show this variant, using a "rotate left 1" instruction to vary the divisor.

Combining everything you get a total of 6 distinct benchmarks.

Results

Overall

For the purposes of picking an overall best algorithm, I looked at each of 12 subsets for the proposed algorithms: (latency, throughput) x (constant a, invariant q, variable q) x (32-bit, 64-bit) and assigned a score of 2, 1, or 0 per subtest as follows:

  • The best algorithm(s) (within 5% tolerance) receive a score of 2.
  • The "close enough" algorithms (no more than 50% slower than the best) receive a score of 1.
  • The remaining algorithms score zero.

Hence, the maximum total score is 24, but no algorithm achieved that. Here are the overall total results:

?????????????????????????????????
?       Algorithm       ? Score ?
?????????????????????????????????
? divide_user23_variant ?    20 ?
? divide_chux           ?    20 ?
? divide_user23         ?    15 ?
? divide_peter          ?    14 ?
? divide_chrisdodd      ?    12 ?
? stoke32               ?    11 ?
? divide_chris          ?     0 ?
? divide_weather        ?     0 ?
?????????????????????????????????
Run Code Online (Sandbox Code Playgroud)

So the for the purposes of this specific test code, with this specific compiler and on this platform, user2357112 "variant" (with ... + (p & mask) != 0) performs best, tied with chux's suggestion (which is in fact identical code).

Here are all the sub-scores which sum to the above:

??????????????????????????????????????????????????????????????????
?                          ? Total ? LC ? LI ? LV ? TC ? TI ? TV ?
??????????????????????????????????????????????????????????????????
? divide_peter             ?     6 ?  1 ?  1 ?  1 ?  1 ?  1 ?  1 ?
? stoke32                  ?     6 ?  1 ?  1 ?  2 ?  0 ?  0 ?  2 ?
? divide_chux              ?    10 ?  2 ?  2 ?  2 ?  1 ?  2 ?  1 ?
? divide_user23            ?     8 ?  1 ?  1 ?  2 ?  2 ?  1 ?  1 ?
? divide_user23_variant    ?    10 ?  2 ?  2 ?  2 ?  1 ?  2 ?  1 ?
? divide_chrisdodd         ?     6 ?  1 ?  1 ?  2 ?  0 ?  0 ?  2 ?
? divide_chris             ?     0 ?  0 ?  0 ?  0 ?  0 ?  0 ?  0 ?
? divide_weather           ?     0 ?  0 ?  0 ?  0 ?  0 ?  0 ?  0 ?
?                          ?       ?    ?    ?    ?    ?    ?    ?
? 64-bit Algorithm         ?       ?    ?    ?    ?    ?    ?    ?
? divide_peter_64          ?     8 ?  1 ?  1 ?  1 ?  2 ?  2 ?  1 ?
? div_stoke_64             ?     5 ?  1 ?  1 ?  2 ?  0 ?  0 ?  1 ?
? divide_chux_64           ?    10 ?  2 ?  2 ?  2 ?  1 ?  2 ?  1 ?
? divide_user23_64         ?     7 ?  1 ?  1 ?  2 ?  1 ?  1 ?  1 ?
? divide_user23_variant_64 ?    10 ?  2 ?  2 ?  2 ?  1 ?  2 ?  1 ?
? divide_chrisdodd_64      ?     6 ?  1 ?  1 ?  2 ?  0 ?  0 ?  2 ?
? divide_chris_64          ?     0 ?  0 ?  0 ?  0 ?  0 ?  0 ?  0 ?
? divide_weather_64        ?     0 ?  0 ?  0 ?  0 ?  0 ?  0 ?  0 ?
??????????????????????????????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)

Here, each test is named like XY, with X in {Latency, Throughput} and Y in {Constant Q, Invariant Q, Variable Q}. So for example, LC is "Latency test with constant q".

Analysis

At the highest level, the solutions can be roughly divided into two categories: fast (the top 6 finishers) and slow (the bottom two). The difference is larger: all of the fast algorithms were the fastest on at least two subtests and in general when they didn't finish first they fell into the "close enough" category (they only exceptions being failed vectorizations in the case of stoke and chrisdodd). The slow algorithms however scored 0 (not even close) on every test. So you can mostly eliminate the slow algorithms from further consideration.

Auto-vectorization

Among the fast algorithms, a large differentiator was the ability to auto-vectorize.

None of the algorithms were able to auto-vectorize in the latency tests, which makes sense since the latency tests are designed to feed their result directly into the next iteration. So you can really only calculate results in a serial fashion.

For the throughput tests, however, many algorithms were able to auto-vectorize for the constant Q and invariant Q case. In both of these tests tests the divisor q is loop-invariant (and in the former case it is a compile-time constant). The dividend is the loop counter, so it is variable, but predicable (and in particular a vector of dividends can be trivially calculated by adding 8 to the previous input vector: [0, 1, 2, ..., 7] + [8, 8, ..., 8] == [8, 9, 10, ..., 15]).

In this scenario, gcc was able to vectorize peter, stoke, chux, user23 and user23_variant. It wasn't able to vectorize chrisdodd for some reason, likely because it included a branch (but conditionals don't strictly prevent vectorization since many other solutions have conditional elements but still vectorized). The impact was huge: algorithms that vectorized showed about an 8x improvement in throughput over variants that didn't but were otherwise fast.

Vectorization isn't free, though! Here are the function sizes for the "constant" variant of each function, with the Vec? column showing whether a function vectorized or not:

Size Vec? Name
 045    N bench_c_div_stoke_64
 049    N bench_c_divide_chrisdodd_64
 059    N bench_c_stoke32_64
 212    Y bench_c_divide_chux_64
 227    Y bench_c_divide_peter_64
 220    Y bench_c_divide_user23_64
 212    Y bench_c_divide_user23_variant_64
Run Code Online (Sandbox Code Playgroud)

The trend is clear - vectorized functions take about 4x the size of the non-vectorized ones. This is both because the core loops themselves are larger (vector instructions tend to be larger and there are more of them), and because loop setup and especially the post-loop code is much larger: for example, the vectorized version requires a reduction to sum all the partial sums in a vector. The loop count is fixed and a multiple of 8, so no tail code is generated - but if were variable the generated code would be even larger.

Furthermore, despite the large improvement in runtime, gcc's vectorization is actually poor. Here's an excerpt from the vectorized version of Peter's routine:

  on entry: ymm4 == all zeros
  on entry: ymm5 == 0x00000001 0x00000001 0x00000001 ...
  4007a4:       c5 ed 76 c4             vpcmpeqd ymm0,ymm2,ymm4
  4007ad:       c5 fd df c5             vpandn   ymm0,ymm0,ymm5
  4007b1:       c5 dd fa c0             vpsubd   ymm0,ymm4,ymm0
  4007b5:       c5 f5 db c0             vpand    ymm0,ymm1,ymm0
Run Code Online (Sandbox Code Playgroud)

This chunk works independently on 8 DWORD elements originating in ymm2. If we take x to be a single DWORD element of ymm2, and y the incoming value of ymm1 these foud instructions correspond to:

                    x == 0   x != 0
x  = x ? 0 : -1; //     -1        0
x  = x & 1;      //      1        0
x  = 0 - x;      //     -1        0
x  = y1 & x;     //     y1        0
Run Code Online (Sandbox Code Playgroud)

So the first three instructions could simple be replaced by the first one, as the states are identical in either case. So that's two cycles added to that dependency chain (which isn't loop carried) and two extra uops. Evidently gcc's optimization phases somehow interact poorly with the vectorization code here, since such trivial optimizations are rarely missed in scalar code. Examining the other vectorized versions similarly shows a lot of performance dropped on the floor.

Branches vs Branch-free

Nearly all of the solutions compiled to branch-free code, even if C code had conditionals or explicit branches. The conditional portions were small enough that the compiler generally decided to use conditional move or some variant. One exception is chrisdodd which compiled with a branch (checking if p == 0) in all the throughput tests, but none of the latency ones. Here's a typical example from the constant q throughput test:

0000000000400e60 <bench_c_divide_chrisdodd_32>:
  400e60:       89 f8                   mov    eax,edi
  400e62:       ba 01 00 00 00          mov    edx,0x1
  400e67:       eb 0a                   jmp    400e73 <bench_c_divide_chrisdodd_32+0x13>
  400e69:       0f 1f 80 00 00 00 00    nop    DWORD PTR [rax+0x0]
  400e70:       83 c2 01                add    edx,0x1
  400e73:       83 fa 01                cmp    edx,0x1
  400e76:       74 f8                   je     400e70 <bench_c_divide_chrisdodd_32+0x10>
  400e78:       8d 4a fe                lea    ecx,[rdx-0x2]
  400e7b:       c1 e9 03                shr    ecx,0x3
  400e7e:       8d 44 08 01             lea    eax,[rax+rcx*1+0x1]
  400e82:       81 fa 00 ca 9a 3b       cmp    edx,0x3b9aca00
  400e88:       75 e6                   jne    400e70 <bench_c_divide_chrisdodd_32+0x10>
  400e8a:       c3                      ret    
  400e8b:       0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
Run Code Online (Sandbox Code Playgroud)

The branch at 400e76 skips the case that p == 0. In fact, the compiler could have just peeled the first iteration out (calculating its result explicitly) and then avoided the jump entirely since after that it can prove that p != 0. In these tests, the branch is perfectly predictable, which could give an advantage to code that actually compiles using a branch (since the compare & branch code is essentially out of line and close to free), and is a big part of why chrisdodd wins the throughput, variable q case.

Detailed Test Results

Here you can find some detailed test results and some details on the tests themselves.

Latency

The results below test each algorithm over 1e9 iterations. Cycles are calculated simply by multiplying the time/call by the clock frequency. You can generally assume that something like 4.01 is the same as 4.00, but the larger deviations like 5.11 seem to be real and reproducible.

The results for divide_plusq_32 use (p + q - 1) >> lg(q) but are only shown for reference, since this function fails for large p + q. The results for dummy are a very simple function: return p + q, and lets you estimate the benchmark overhead5 (the addition itself should take a cycle at most).

==============================
Bench: Compile-time constant Q
==============================
                  Function         ns/call    cycles
           divide_peter_32            2.19      5.67
           divide_peter_64            2.18      5.64
                stoke32_32            1.93      5.00
                stoke32_64            1.97      5.09
              stoke_mul_32            2.75      7.13
              stoke_mul_64            2.34      6.06
              div_stoke_32            1.94      5.03
              div_stoke_64            1.94      5.03
            divide_chux_32            1.55      4.01
            divide_chux_64            1.55      4.01
          divide_user23_32            1.97      5.11
          divide_user23_64            1.93      5.00
  divide_user23_variant_32            1.55      4.01
  divide_user23_variant_64            1.55      4.01
       divide_chrisdodd_32            1.95      5.04
       divide_chrisdodd_64            1.93      5.00
           divide_chris_32            4.63     11.99
           divide_chris_64            4.52     11.72
         divide_weather_32            2.72      7.04
         divide_weather_64            2.78      7.20
           divide_plusq_32            1.16      3.00
           divide_plusq_64            1.16      3.00
           divide_dummy_32            1.16      3.00
           divide_dummy_64            1.16      3.00


==============================
Bench: Invariant Q
==============================
                  Function         ns/call    cycles
           divide_peter_32            2.19      5.67
           divide_peter_64            2.18      5.65
                stoke32_32            1.93      5.00
                stoke32_64            1.93      5.00
              stoke_mul_32            2.73      7.08
              stoke_mul_64            2.34      6.06
              div_stoke_32            1.93      5.00
              div_stoke_64            1.93      5.00
            divide_chux_32            1.55      4.02
            divide_chux_64            1.55      4.02
          divide_user23_32            1.95      5.05
          divide_user23_64            2.00      5.17
  divide_user23_variant_32            1.55      4.02
  divide_user23_variant_64            1.55      4.02
       divide_chrisdodd_32            1.95      5.04
       divide_chrisdodd_64            1.93      4.99
           divide_chris_32            4.60     11.91
           divide_chris_64            4.58     11.85
         divide_weather_32           12.54     32.49
         divide_weather_64           17.51     45.35
           divide_plusq_32            1.16      3.00
           divide_plusq_64            1.16      3.00
           divide_dummy_32            0.39      1.00
           divide_dummy_64            0.39      1.00


==============================
Bench: Variable Q
==============================
                  Function         ns/call    cycles
           divide_peter_32            2.31      5.98
           divide_peter_64            2.26      5.86
                stoke32_32            2.06      5.33
                stoke32_64            1.99      5.16
              stoke_mul_32            2.73      7.06
              stoke_mul_64            2.32      6.00
              div_stoke_32            2.00      5.19
              div_stoke_64            2.00      5.19
            divide_chux_32            2.04      5.28
            divide_chux_64            2.05      5.30
          divide_user23_32            2.05      5.30
          divide_user23_64            2.06      5.33
  divide_user23_variant_32            2.04      5.29
  divide_user23_variant_64            2.05      5.30
       divide_chrisdodd_32            2.04      5.30
       divide_chrisdodd_64            2.05      5.31
           divide_chris_32            4.65     12.04
           divide_chris_64            4.64     12.01
         divide_weather_32           12.46     32.28
         divide_weather_64           19.46     50.40
           divide_plusq_32            1.93      5.00
           divide_plusq_64            1.99      5.16
           divide_dummy_32            0.40      1.05
           divide_dummy_64            0.40      1.04
Run Code Online (Sandbox Code Playgroud)

Throughput

Here are the results for the throughput tests. Note that many of the algorithms here were auto-vectorized, so the performance is relatively very good for those: a fraction of a cycle in many cases. One result is that unlike most latency results, the 64-bit functions are considerably slower, since vectorization is more effective with smaller element sizes (although the gap is larger that I would have expected).

==============================
Bench: Compile-time constant Q
==============================
                  Function         ns/call    cycles
                stoke32_32            0.39      1.00
            divide_chux_32            0.15      0.39
            divide_chux_64            0.53      1.37
          divide_user23_32            0.14      0.36
          divide_user23_64            0.53      1.37
  divide_user23_variant_32            0.15      0.39
  divide_user23_variant_64            0.53      1.37
       divide_chrisdodd_32            1.16      3.00
       divide_chrisdodd_64            1.16      3.00
           divide_chris_32            4.34     11.23
           divide_chris_64            4.34     11.24
         divide_weather_32            1.35      3.50
         divide_weather_64            1.35      3.50
           divide_plusq_32            0.10      0.26
           divide_plusq_64            0.39      1.00
           divide_dummy_32            0.08      0.20
           divide_dummy_64            0.39      1.00


==============================
Bench: Invariant Q
==============================
                  Function         ns/call    cycles
                stoke32_32            0.48      1.25
            divide_chux_32            0.15      0.39
            divide_chux_64            0.48      1.25
          divide_user23_32            0.17      0.43
          divide_user23_64            0.58      1.50
  divide_user23_variant_32            0.15      0.38
  divide_user23_variant_64            0.48      1.25
       divide_chrisdodd_32            1.16      3.00
       divide_chrisdodd_64            1.16      3.00
           divide_chris_32            4.35     11.26
           divide_chris_64            4.36     11.28
         divide_weather_32            5.79     14.99
         divide_weather_64           17.00     44.02
           divide_plusq_32            0.12      0.31
           divide_plusq_64            0.48      1.25
           divide_dummy_32            0.09      0.23
           divide_dummy_64            0.09      0.23


==============================
Bench: Variable Q
==============================
                  Function         ns/call    cycles
                stoke32_32            1.16      3.00
            divide_chux_32            1.36      3.51
            divide_chux_64            1.35      3.50
          divide_user23_32            1.54      4.00
          divide_user23_64            1.54      4.00
  divide_user23_variant_32            1.36      3.51
  divide_user23_variant_64            1.55      4.01
       divide_chrisdodd_32            1.16      3.00
       divide_chrisdodd_64            1.16      3.00
           divide_chris_32            4.02     10.41
           divide_chris_64            3.84      9.95
         divide_weather_32            5.40     13.98
         divide_weather_64           19.04     49.30
           divide_plusq_32            1.03      2.66
           divide_plusq_64            1.03      2.68
           divide_dummy_32            0.63      1.63
           divide_dummy_64            0.66      1.71
Run Code Online (Sandbox Code Playgroud)

a At least by specifying unsigned we avoid the whole can of worms related to the right-shift behavior of signed integers in C and C++.

0 Of course, this notation doesn't actually work in C where / truncates the result so the ceiling does nothing. So consider that pseudo-notation rather than straight C.

1 I'm also interested solutions where all types are uint32_t rather than uint64_t.

2 In general, any p and q where p + q >= 2^64 causes an issue, due to overflow.

3 That said, the function should be in a loop, because the performance of a microscopic function that takes half a dozen cycles only really matters if it is called in a fairly tight loop.

4 This is a bit of a simplification - only the dividend p is dependent on the output of the previous iteration, so some work related to processing of q can still be overlapped.

5 Use such estimates with caution however - overhead isn't simply additive. If the overhead shows up as 4 cycles and some function f takes 5, it's likely not accurate to say the cost of the real work in f is 5 - 4 == 1, because of the way execution is overlapped.

Pet*_*des 16

This answer is about what's ideal in asm; what we'd like to convince the compiler to emit for us. (I'm not suggesting actually using inline asm, except as a point of comparison when benchmarking compiler output. https://gcc.gnu.org/wiki/DontUseInlineAsm).

I did manage to get pretty good asm output from pure C for ceil_div_andmask, see below. (It's worse than a CMOV on Broadwell/Skylake, but probably good on Haswell. Still, the user23/chux version looks even better for both cases.) It's mostly just worth mentioning as one of the few cases where I got the compiler to emit the asm I wanted.


It looks like Chris Dodd's general idea of return ((p-1) >> lg(q)) + 1 with special-case handling for d=0 is one of the best options. I.e. the optimal implementation of it in asm is hard to beat with an optimal implementation of anything else. Chux's (p >> lg(q)) + (bool)(p & (q-1)) also has advantages (like lower latency from p->result), and more CSE when the same q is used for multiple divisions. See below for a latency/throughput analysis of what gcc does with it.

If the same e = lg(q) is reused for multiple dividends, or the same dividend is reused for multiple divisors, different implementations can CSE more of the expression. They can also effectively vectorize with AVX2.

Branches are cheap and very efficient if they predict very well, so branching on d==0 will be best if it's almost never taken. If d==0 is not rare, branchless asm will perform better on average. Ideally we can write something in C that will let gcc make the right choice during profile-guided optimization, and compiles to good asm for either case.

Since the best branchless asm implementations don't add much latency vs. a branchy implementation, branchless is probably the way to go unless the branch would go the same way maybe 99% of the time. This might be likely for branching on p==0, but probably less likely for branching on p & (q-1).


It's hard to guide gcc5.4 into emitting anything optimal. This is my work-in-progress on Godbolt).

I think the optimal sequences for Skylake for this algorithm are as follows. (Shown as stand-alone functions for the AMD64 SysV ABI, but talking about throughput/latency on the assumption that the compiler will emit something similar inlined into a loop, with no RET attached).


Branch on carry from calculating d-1 to detect d==0, instead of a separate test & branch. Reduces the uop count nicely, esp on SnB-family where JC can macro-fuse with SUB.

ceil_div_pjc_branch:
 xor    eax,eax          ; can take this uop off the fast path by adding a separate xor-and-return block, but in reality we want to inline something like this.
 sub    rdi, 1
 jc    .d_was_zero       ; fuses with the sub on SnB-family
 tzcnt  rax, rsi         ; tzcnt rsi,rsi also avoids any false-dep problems, but this illustrates that the q input can be read-only.
 shrx   rax, rdi, rax
 inc    rax
.d_was_zero:
 ret
Run Code Online (Sandbox Code Playgroud)
  • Fused-domain uops: 5 (not counting ret), and one of them is an xor-zero (no execution unit)
  • HSW/SKL latency with successful branch prediction:
    • (d==0): No data dependency on d or q, breaks the dep chain. (control dependency on d to detect mispredicts and retire the branch).
    • (d!=0): q->result: tzcnt+shrx+inc = 5c
    • (d!=0): d->result: sub+shrx+inc = 3c
  • Throughput: probably just bottlenecked on uop throughput

I've tried but failed to get gcc to branch on CF from the subtract, but it always wants to do a separate comparison. I know gcc can be coaxed into branching on CF after subtracting two variables, but maybe this fails if one is a compile-time constant. (IIRC, this typically compiles to a CF test with unsigned vars: foo -= bar; if(foo>bar) carry_detected = 1;)


Branchless with ADC/SBB to handle the d==0 case. Zero-handling adds only one instruction to the critical path (vs. a version with no special handling for d==0), but also converts one other from an INC to a sbb rax, -1 to make CF undo the -= -1. Using a CMOV is cheaper on pre-Broadwell, but takes extra instructions to set it up.

ceil_div_pjc_asm_adc:
 tzcnt  rsi, rsi
 sub    rdi, 1
 adc    rdi, 0          ; d? d-1 : d.  Sets CF=CF
 shrx   rax, rdi, rsi
 sbb    rax, -1         ; result++ if d was non-zero
 ret    
Run Code Online (Sandbox Code Playgroud)
  • Fused-domain uops: 5 (not counting ret) on SKL. 7 on HSW
  • SKL latency:
    • q->result: tzcnt+shrx+sbb = 5c
    • d->result: sub+adc+shrx(dep on q begins here)+sbb = 4c
  • Throughput: TZCNT runs on p1. SBB, ADC, and SHRX only run on p06. So I think we bottleneck on 3 uops for p06 per iteration, making this run at best one iteration per 1.5c.

If q and d become ready at the same time, note that this version can run SUB/ADC in parallel with the 3c latency of TZCNT. If both are coming from the same cache-miss cache line, it's certainly possible. In any case, introducing the dep on q as late as possible in the d->result dependency chain is an advantage.

gcc5.4似乎不太可能从C中获取此信息.有一个固有的附加携带,但gcc使它完全混乱.它不使用ADC或SBB的立即操作数,并将进位存储在每个操作之间的整数寄存器中.gcc7,clang3.9和icc17都是由此产生的可怕代码.

#include <x86intrin.h>
// compiles to completely horrible code, putting the flags into integer regs between ops.
T ceil_div_adc(T d, T q) {
  T e = lg(q);
  unsigned long long dm1;  // unsigned __int64
  unsigned char CF = _addcarry_u64(0, d, -1, &dm1);
  CF = _addcarry_u64(CF, 0, dm1, &dm1);
  T shifted = dm1 >> e;
  _subborrow_u64(CF, shifted, -1, &dm1);
  return dm1;
}
    # gcc5.4 -O3 -march=haswell
    mov     rax, -1
    tzcnt   rsi, rsi
    add     rdi, rax
    setc    cl
    xor     edx, edx
    add     cl, -1
    adc     rdi, rdx
    setc    dl
    shrx    rdi, rdi, rsi
    add     dl, -1
    sbb     rax, rdi
    ret
Run Code Online (Sandbox Code Playgroud)

CMOV用于修复整个结果:q->结果的延迟更差,因为它在d->结果dep链中使用得更快.

ceil_div_pjc_asm_cmov:
 tzcnt  rsi, rsi
 sub    rdi, 1
 shrx   rax, rdi, rsi
 lea    rax, [rax+1]     ; inc preserving flags
 cmovc  rax, zeroed_register
 ret    
Run Code Online (Sandbox Code Playgroud)
  • 融合域uops:SKL上的5(不计算ret).关于HSW的6
  • SKL潜伏期:
    • q->结果:tzcnt + shrx + lea + cmov = 6c(比ADC/SBB差1c)
    • d->结果:sub + shrx(q从这里开始)+ lea + cmov = 4c
  • Throughput: TZCNT runs on p1. LEA is p15. CMOV and SHRX are p06. SUB is p0156. In theory only bottlenecked on fused-domain uop throughput, so one iteration per 1.25c. With lots of independent operations, resource conflicts from SUB or LEA stealing p1 or p06 shouldn't be a throughput problem because at 1 iter per 1.25c, no port is saturated with uops that can only run on that port.

CMOV to get an operand for SUB: I was hoping I could find a way to create an operand for a later instruction that would produce a zero when needed, without an input dependency on q, e, or the SHRX result. This would help if d is ready before q, or at the same time.

This doesn't achieve that goal, and needs an extra 7-byte mov rdx,-1 in the loop.

ceil_div_pjc_asm_cmov:
 tzcnt  rsi, rsi
 mov    rdx, -1
 sub    rdi, 1
 shrx   rax, rdi, rsi
 cmovnc rdx, rax
 sub    rax, rdx       ; res += d ? 1 : -res
 ret    
Run Code Online (Sandbox Code Playgroud)

Lower-latency version for pre-BDW CPUs with expensive CMOV, using SETCC to create a mask for AND.

ceil_div_pjc_asm_setcc:
 xor    edx, edx        ; needed every iteration

 tzcnt  rsi, rsi
 sub    rdi, 1

 setc   dl              ; d!=0 ?  0 : 1
 dec    rdx             ; d!=0 ? -1 : 0   // AND-mask

 shrx   rax, rdi, rsi
 inc    rax
 and    rax, rdx        ; zero the bogus result if d was initially 0
 ret    
Run Code Online (Sandbox Code Playgroud)

Still 4c latency from d->result (and 6 from q->result), because the SETC/DEC happen in parallel with the SHRX/INC. Total uop count: 8. Most of these insns can run on any port, so it should be 1 iter per 2 clocks.

Of course, for pre-HSW, you also need to replace SHRX.

We can get gcc5.4 to emit something nearly as good: (still uses a separate TEST instead of setting mask based on sub rdi, 1, but otherwise the same instructions as above). See it on Godbolt.

T ceil_div_andmask(T p, T q) {
    T mask = -(T)(p!=0);  // TEST+SETCC+NEG
    T e = lg(q);
    T nonzero_result = ((p-1) >> e) + 1;
    return nonzero_result & mask;
}
Run Code Online (Sandbox Code Playgroud)

When the compiler knows that p is non-zero, it takes advantage and makes nice code:

// http://stackoverflow.com/questions/40447195/can-i-hint-the-optimizer-by-giving-the-range-of-an-integer
#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
#define assume(x) do{if(!(x)) __builtin_unreachable();}while(0)
#else
#define assume(x) (void)(x) // still evaluate it once, for side effects in case anyone is insane enough to put any inside an assume()
#endif

T ceil_div_andmask_nonzerop(T p, T q) {
  assume(p!=0);
  return ceil_div_andmask(p, q);
}
    # gcc5.4 -O3 -march=haswell
    xor     eax, eax             # gcc7 does tzcnt in-place instead of wasting an insn on this
    sub     rdi, 1
    tzcnt   rax, rsi
    shrx    rax, rdi, rax
    add     rax, 1
    ret
Run Code Online (Sandbox Code Playgroud)

Chux/user23_variant

only 3c latency from p->result, and constant q can CSE a lot.

T divide_A_chux(T p, T q) {
  bool round_up = p & (q-1);  // compiles differently from user23_variant with clang: AND instead of 
  return (p >> lg(q)) + round_up;
}

    xor     eax, eax           # in-place tzcnt would save this
    xor     edx, edx           # target for setcc
    tzcnt   rax, rsi
    sub     rsi, 1
    test    rsi, rdi
    shrx    rdi, rdi, rax
    setne   dl
    lea     rax, [rdx+rdi]
    ret
Run Code Online (Sandbox Code Playgroud)

Doing the SETCC before TZCNT would allow an in-place TZCNT, saving the xor eax,eax. I haven't looked at how this inlines in a loop.

  • Fused-domain uops: 8 (not counting ret) on HSW/SKL
  • HSW/SKL latency:
    • q->result: (tzcnt+shrx(p) | sub+test(p)+setne) + lea(or add) = 5c
    • d->result: test(dep on q begins here)+setne+lea = 3c. (the shrx->lea chain is shorter, and thus not the critical path)
  • Throughput: Probably just bottlenecked on the frontend, at one iter per 2c. Saving the xor eax,eax should speed this up to one per 1.75c (but of course any loop overh


use*_*ica 11

uint64_t exponent = lg(q);
uint64_t mask = q - 1;
//     v divide
return (p >> exponent) + (((p & mask) + mask) >> exponent)
//                       ^ round up
Run Code Online (Sandbox Code Playgroud)

"向上"部分的单独计算避免了溢出问题(p+q-1) >> lg(q).根据编译器的智能程度,可以将"向上"部分表示为((p & mask) != 0)不分支.


Chr*_*odd 8

对于C中的无符号整数,除以2的幂的有效方式是右移 - 右移一除二(向下舍入),因此右移n除以2 n(向下舍入).

现在你想要向上舍入而不是向下舍入,你可以通过首先加上2 n -1,或等效地在移位之前减去一个并在之后加一个(除了0)来做.这有点类似于:

unsigned ceil_div(unsigned d, unsigned e) {
    /* compute ceil(d/2**e) */
    return d ? ((d-1) >> e) + 1 : 0;
}
Run Code Online (Sandbox Code Playgroud)

可以通过使用d的布尔值来删除条件,用于加1和减1:

unsigned ceil_div(unsigned d, unsigned e) {
    /* compute ceil(d/2**e) */
    return ((d - !!d) >> e) + !!d;
}
Run Code Online (Sandbox Code Playgroud)

由于其尺寸和速度要求,该功能应该是静态内联的.它可能不会对优化器产生不同,但参数应该是const.如果必须在许多文件之间共享,请在标头中定义它:

static inline unsigned ceil_div(const unsigned d, const unsigned e){...
Run Code Online (Sandbox Code Playgroud)


chu*_*ica 8

有效地将无符号值除以2的幂,四舍五入

[重写]给出OP 关于2的幂澄清.

当溢出不是问题时,圆形或天花板部分很容易.简单添加q-1,然后转移.

否则,因为舍入的可能性取决于所有p小于q的位,所以在它们被移出之前首先需要检测那些位.

uint64_t divide_A(uint64_t p, uint64_t q) {
  bool round_up = p & (q-1);
  return (p >> lg64(q)) + round_up;
} 
Run Code Online (Sandbox Code Playgroud)

这假设代码具有一个有效的lg64(uint64_t x)函数,它返回base-2 log x,如果x是2 的幂


Chr*_*ris 7

如果p只有两个(哎呀)的力量,我的旧答案不起作用.所以我的新的解决方案,使用__builtin_ctzll()__builtin_ffsll()功能0提供gcc(其中作为奖金,提供你所提到的快数!):

uint64_t divide(uint64_t p,uint64_t q) {
    int lp=__builtin_ffsll(p);
    int lq=__builtin_ctzll(q);
    return (p>>lq)+(lp<(lq+1)&&lp);
}
Run Code Online (Sandbox Code Playgroud)

注意,这假设a long long是64位.它必须稍微调整一下.

这里的想法是,如果我们需要溢出,当且仅当p尾随零的数量少于q.请注意,对于2的幂,尾随零的数量等于对数,因此我们也可以将此内置函数用于日志.

&&lp部分仅适用于p零的情况:否则它将1在此处输出.

0不能同时__builtin_ctzll()用于两者,因为它是未定义的p==0.

  • 这对于'p = 1,q = 2`失败了.正确的答案是1,但你的代码给出零:`((1 >> 1)+(2 >> 1)-1)>>(lg(2)-1)==(0 + 1-1)>> (1-1)== 0 >> 0 == 0`. (2认同)
  • 您的新功能有效,我在上面添加了基准测试.然而,它比领先的解决方案慢得多. (2认同)

tec*_*rus 6

如果可以保证红利/除数不超过63(或31)位,您可以使用问题中提到的以下版本.注意如果p + q使用全部64位,它们将如何溢出.如果SHR指令在进位标志中移位,这将没有问题,但AFAIK则没有.

uint64_t divide(uint64_t p, uint64_t q) {
  return (p + q - 1) >> lg(q);
}
Run Code Online (Sandbox Code Playgroud)

如果无法保证这些约束,您可以只使用floor方法,然后添加1,如果它将向上舍入.这可以通过检查被除数中的任何位是否在除数范围内来确定.注意:p&-p提取2s补码机或BLSI指令的最低设置位

uint64_t divide(uint64_t p, uint64_t q) {
  return (p >> lg(q)) + ( (p & -p ) < q );
}
Run Code Online (Sandbox Code Playgroud)

哪个clang编译为:

    bsrq    %rax, %rsi
    shrxq   %rax, %rdi, %rax
    blsiq   %rdi, %rcx
    cmpq    %rsi, %rcx
    adcq    $0, %rax
    retq
Run Code Online (Sandbox Code Playgroud)

这有点罗嗦并使用一些较新的指令,所以也许有一种方法可以在原始版本中使用进位标志.让我们看看:RCR指令确实如此,但似乎会更糟......也许是SHRD指令......它会是这样的(此刻无法测试)

    xor     edx, edx     ;edx = 0 (will store the carry flag)
    bsr     rcx, rsi     ;rcx = lg(q) ... could be moved anywhere before shrd
    lea     rax, [rsi-1] ;rax = q-1 (adding p could carry)
    add     rax, rdi     ;rax += p  (handle carry)
    setc    dl           ;rdx = carry flag ... or xor rdx and setc
    shrd    rax, rdx, cl ;rax = rdx:rax >> cl
    ret
Run Code Online (Sandbox Code Playgroud)

另外1条指令,但应与旧处理器兼容(如果它可以工作......我总是得到一个源/目的地交换 - 随意编辑)

附录:

我已经实现了lg()函数我说的可用如下:

inline uint64_t lg(uint64_t x) {
  return 63U - (uint64_t)__builtin_clzl(x);
}

inline uint32_t lg32(uint32_t x) {
  return 31U - (uint32_t)__builtin_clz(x);
}
Run Code Online (Sandbox Code Playgroud)

快速日志功能没有完全优化到clang和ICC上的bsr,但你可以这样做:

#if defined(__x86_64__) && (defined(__clang__) || defined(__INTEL_COMPILER))
static inline uint64_t lg(uint64_t x){
  inline uint64_t ret;
  //other compilers may want bsrq here
  __asm__("bsr  %0, %1":"=r"(ret):"r"(x));
  return ret;
}
#endif

#if defined(__i386__) && (defined(__clang__) || defined(__INTEL_COMPILER))    
static inline uint32_t lg32(uint32_t x){
  inline uint32_t ret;
  __asm__("bsr  %0, %1":"=r"(ret):"r"(x));
  return ret;
}
#endif
Run Code Online (Sandbox Code Playgroud)


Bee*_*ope 5

已经有很多人的脑力应用于此问题,C彼得·科德斯(Peter Cordes)的答案涵盖了您可能希望在asm中获得的最好成绩,并提供了一些尝试将其映射回的注意事项C

因此,当人类大吃一惊时,我想看看有什么蛮力的计算能力要说!为此,我使用了斯坦福大学的STOKE超级优化器来尝试找到此问题的32位和64位版本的好的解决方案。

通常,超级优化程序通常类似于蛮力搜索所有可能的指令序列,直到您通过某种度量找到最佳的。当然,如果有大约1,000条指令,则很快就会失去控制,导致超过几条指令1。另一方面,STOKE采用了一种指导性的随机方法:它随机地对现有的候选程序进行变异,并在每个步骤中评估一个兼顾性能和正确性的成本函数。不管怎么说,那是一线- 如果引起您的好奇心,有很多文件

因此,在几分钟内,STOKE找到了一些非常有趣的解决方案。它发现了现有解决方案中几乎所有的高级想法,以及一些独特的想法。例如,对于32位函数,STOKE找到了以下版本:

neg rsi                         
dec rdi                         
pext rax, rsi, rdi
inc eax
ret
Run Code Online (Sandbox Code Playgroud)

它根本不使用任何前导/后置零计数或移位指令。差不多,它用于neg esi将除数转换为高位为1的掩码,然后pext使用该掩码有效地进行移位。在该技巧之外,它使用的是用户QuestionC所使用的技巧:递减p,移位,递增p-但即使对于零分红,它也可以正常工作,因为它使用64位寄存器来区分零位和MSB设置的p大写。 。

我将此算法的C版本添加到基准测试中,并将其添加到结果中。它与其他好的算法相比具有竞争力,在“ Variable Q”案例中排名第一。它确实可以向量化,但不如其他32位算法好,因为它需要64位数学运算,因此向量一次只能处理一半的元素。

更好的是,在32位的情况下,它提出了多种解决方案,这些解决方案利用以下事实:如果您使用64位ops的一部分,则某些在边缘情况下失败的直观解决方案会“正常工作”。例如:

tzcntl ebp, esi      
dec esi             
add rdi, rsi        
sarx rax, rdi, rbp
ret
Run Code Online (Sandbox Code Playgroud)

这相当于return (p + q - 1) >> lg(q)我在问题中提到的建议。一般来说,这不起作用,因为p + q它会大量溢出,但是对于32位pq此解决方案通过在64位中执行重要部分而非常有效。通过一些强制转换将其转换回C,实际上它发现使用lea会在一条指令1中进行加法:

stoke_32(unsigned int, unsigned int):
        tzcnt   edx, esi
        mov     edi, edi          ; goes away when inlining
        mov     esi, esi          ; goes away when inlining
        lea     rax, [rsi-1+rdi]
        shrx    rax, rax, rdx
        ret
Run Code Online (Sandbox Code Playgroud)

因此,当内联到已经具有零扩展到rdi和中的值的东西中时,这是一个3指令的解决方案rsi。独立函数定义需要mov零扩展指令,因为这就是SysV x64 ABI的工作方式


对于64位功能,它并没有提出任何能够破坏现有解决方案的东西,但确实提出了一些简洁的东西,例如:

  tzcnt  r13, rsi      
  tzcnt  rcx, rdi      
  shrx   rax, rdi, r13 
  cmp    r13b, cl        
  adc    rax, 0        
  ret
Run Code Online (Sandbox Code Playgroud)

那家伙计算两个参数的尾随零,然后如果q尾随零数少于p,则将1加到结果中,因为那是您需要取整的时候。聪明!

一般情况下,它的理解是你需要的想法shltzcnt真的很快(就像大多数人),然后用一吨的其他解决方案来调整结果占四舍五入的问题上来。它甚至设法利用blsibzhi几个解决方案。这是它提出的5条指令解决方案:

tzcnt r13, rsi                  
shrx  rax, rdi, r13             
imul  rsi, rax                   
cmp   rsi, rdi                    
adc   rax, 0                    
ret 
Run Code Online (Sandbox Code Playgroud)

这基本上是一种“乘法并验证”的方法-将截断res = p \ q后的值乘回,如果与加法运算符不同preturn res * q == p ? ret : ret + 1。凉。不过,这并不比Peter的解决方案更好。STOKE在延迟计算中似乎存在一些缺陷-它认为上述延迟为5-但根据架构的不同,它更像是8或9。因此,有时会基于有缺陷的延迟计算来缩小解决方案的范围。


1有趣的是,尽管这种蛮力方法在5到6条指令之间达到了可行性:如果您假设可以通过消除SIMD和x87指令将指令数减少到300,那么您将需要大约28天的时间才能尝试使用300 ^ 55条指令序列1,000,000个候选人/秒。您可以通过各种优化将其减少1000倍,这意味着5条指令序列的时间少于一个小时,而6条指令的时间可能不到一周。碰巧的是,针对此问题的大多数最佳解决方案都落在5-6条指令窗口中...

2这将是一个缓慢的过程lea,因此STOKE发现的序列对于我进行了优化的延迟仍然是最佳的。