C++优化之谜

Mat*_*nti 4 c++ optimization bit-shift multiplication

采取以下两个片段:

int main()
{
    unsigned long int start = utime();

    __int128_t n = 128;

    for(__int128_t i=1; i<1000000000; i++)
        n = (n * i);

    unsigned long int end = utime();

    cout<<(unsigned long int) n<<endl;

    cout<<end - start<<endl;
}
Run Code Online (Sandbox Code Playgroud)

int main()
{
    unsigned long int start = utime();

    __int128_t n = 128;

    for(__int128_t i=1; i<1000000000; i++)
        n = (n * i) >> 2;

    unsigned long int end = utime();

    cout<<(unsigned long int) n<<endl;

    cout<<end - start<<endl;
}
Run Code Online (Sandbox Code Playgroud)

我在C++中对128位整数进行基准测试.当执行第一个(只是乘法)时,一切都在大约运行.0.95秒 当我还添加位移操作(第二个片段)时,执行时间会增加到惊人的2.49秒.

这怎么可能?我认为位移是处理器最轻的操作之一.怎么这么简单的操作会导致如此多的开销呢?我正在编译O3标志激活.

任何的想法?

bog*_*dan 12

这个问题在过去的几天一直困扰着我,所以我决定再做一些调查.我的初步答案集中在两个测试之间数据值的差异.我的断言是,如果其中一个操作数为零,则处理器中的整数乘法单元在较少的时钟周期内完成操作.

While there are instructions that are clearly documented to work that way (integer division, for example), there are very strong indications that integer multiplication is done in a constant number of cycles in modern processors, regardless of input. The note in Intel's documentation that initially made me think that the number of cycles for integer multiplication can depend on input data doesn't seem to apply to these instructions. Also, I did some more rigorous performance tests with the same sequence of instructions on both zero and non-zero operands and the results didn't yield significant differences. As far as I can tell, harold's comment on this subject is correct. My mistake; sorry.

While contemplating the possibility of deleting this answer altogether, so that it doesn't lead people astray in the future, I realized there were still quite a few more things worth saying on this subject. I also think there's at least one other way in which data values can influence performance in such calculations (included in the last section). So, I decided to restructure and enhance the rest of the information in my initial answer, started writing, and... didn't quite stop for a while. It's up to you to decide whether it was worth it.

The information is structured into the following sections:

  • What the code does
  • What the compiler does
  • What the processor does
  • What you can do about it
  • Unanswered questions

What the code does

It overflows, mostly.

In the first version, n starts overflowing on the 33rd iteration. In the second version, with the shift, n starts overflowing on the 52nd iteration.

In the version without the shift, starting with the 128th iteration, n is zero (it overflows "cleanly", leaving only zeros in the least significant 128 bits of the result).

In the second version, the right shift (dividing by 4) takes out more factors of two from the value of n on each iteration than the new operands bring in, so the shift results in rounding on some iterations. Quick calculation: the total number of factors of two in all numbers from 1 to 128 is equal to

128/2 + 128/4 + ... + 2 + 1 = 26 + 25 + ... + 2 + 1 = 27 - 1

while the number of factors of two taken out by the right shift (if it had enough to take from) is 128*2, more than double.

Armed with this knowledge, we can give a first answer: from the point of view of the C++ standard, this code spends most of its time in undefined behaviour land, so all bets are off. Problem solved; stop reading now.

What the compiler does

如果你还在阅读,从现在开始我们将忽略溢出并查看一些实现细节.在这种情况下,"编译器"表示GCC 4.9.2或Clang 3.5.1.我只对GCC生成的代码进行了性能测量.对于Clang,我查看了几个测试用例的生成代码,并注意到我将在下面提到的一些差异,但我实际上没有运行代码; 我可能错过了一些东西.

乘法和移位操作都可用于64位操作数,因此需要根据这些操作实现128位操作.首先,乘法:n可写为2 64 nh + nl,其中nhnl分别为高和低64位.同样的道理i.因此,乘法可以写成:

(2 64 nh + nl)(2 64 ih + il)= 2 128 nh ih + 2 64(nh il+ nl ih)+nl il

第一项在较低的128位部分中没有任何非零位; 它要么全部溢出,要么全部为零.由于忽略整数溢出对于C++实现是有效且常见的,这就是编译器的作用:第一个术语被完全忽略.

The parenthesis only contributes bits to the upper 64-bit half of the 128-bit result; any overflow resulting from the two multiplications or the addition is also ignored (the result is truncated to 64 bits).

The last term determines the bits in the low 64-bit half of the result and, if the result of that multiplication has more than 64 bits, the extra bits need to be added to the high 64-bit half obtained from the parenthesis discussed before. There's a very useful multiplication instruction in x86-64 assembly that does just what's needed: takes two 64-bit operands and places the result in two 64-bit registers, so the high half is ready to be added to the result of the operations in the parenthesis.

That is how 128-bit integer multiplication is implemented: three 64-bit multiplications and two 64-bit additions.

Now, the shift: using the same notations as above, the two least significant bits of nh need to become the two most significant bits of nl, after the contents of the latter is shifted right by two bits. Using C++ syntax, it would look like this:

nl = nh << 62 | nl >> 2 //Doesn't change nh, only uses its bits.
Run Code Online (Sandbox Code Playgroud)

Besides that, nh also needs to be shifted, using something like

nh >>= 2;
Run Code Online (Sandbox Code Playgroud)

That is how the compiler implements a 128-bit shift. For the first part, there's an x86-64 instruction that has the exact semantics of that expression; it's called SHRD. Using it can be good or bad, as we'll see below, and the two compilers make slightly different choices in this respect.

What the processor does

... is highly processor-dependent. (No... really?!)

Detailed information about what happens for Haswell processors can be found in harold's excellent answer. Here, I'll try to cover more ground at a higher level. For more detailed data, here are some sources:

I'll refer to the following architectures:

I have measurement data taken on an IntelSB system; I think it's precise enough, as long as the compiler doesn't act up. Unfortunately, when working with such tight loops, this can happen very easily. At various points during testing, I had to use all kinds of stupid tricks to avoid GCC's idiosyncrasies, usually related to register use. For example, it seems to have a tendency to shuffle registers around unnecessarily, when compiling simpler code than for other cases when it generates optimal assembly. Ironically, on my test setup, it tended to generate optimal code for the second sample, using the shift, and worse code for the first one, making the impact of the shift less visible. Clang/LLVM seems to have fewer of those bad habits, but then again, I looked at fewer examples using it and I didn't measure any of them, so this doesn't mean much. In the interest of comparing apples with apples, all measurement data below refers to the best code generated for each case.

First, let's rearrange the expression for 128-bit multiplication from the previous section into a (horrible) diagram:

nh * il
        \
         +  -> tmp
        /          \
nl * ih             + -> next nh
                   /
             high 64 bits
                 /
nl * il --------
         \
      low 64 bits 
           \
             -> next nl
Run Code Online (Sandbox Code Playgroud)

(sorry, I hope it gets the point across)

Some important points:

  • The two additions can't execute until their respective inputs are ready; the final addition can't execute until everything else is ready.
  • The three multiplications can, theoretically, execute in parallel (no input depends on another multiplication's output).
  • In the ideal scenario above, the total number of cycles to complete the entire calculation for one iteration is the sum of the number of cycles for one multiplication and two additions.
  • The next nl can be ready early. This, together with the fact that the next il and ih are very cheap to calculate, means the nl * il and nl * ih calculations for the next iteration can start early, possibly before the next nh has been computed.

Multiplications can't really execute entirely in parallel on these processors, as there's only one integer multiplication unit for each core, but they can execute concurrently through pipelining. One multiplication can begin executing on each cycle on Intel, and every 4 cycles on AMD, even if previous multiplications haven't finished executing yet.

All of the above mean that, if the loop's body doesn't contain anything else that gets in the way, the processor can reorder those multiplications to achieve something as close as possible to the ideal scenario above. This applies to the first code snippet. On IntelH, as measured by harold, it's exactly the ideal scenario: 5 cycles per iteration are made up of 3 cycles for one multiplication and one cycle each for the two additions (impressive, to be honest). On IntelSB, I measured 6 cycles per iteration (closer to 5.5, actually).

The problem is that in the second code snippet something does get in the way:

nh * il
        \                              normal shift -> next nh
         +  -> tmp                   /
        /          \                /
nl * ih             + ----> temp nh
                   /                \
             high 64 bits            \
                 /                     "composite" shift -> next nl
nl * il --------                     /
         \                          /
      low 64 bits                  /
           \                      /
             -> temp nl ---------
Run Code Online (Sandbox Code Playgroud)

The next nl is no longer ready early. temp nl has to wait for temp nh to be ready, so that both can be fed into the composite shift, and only then will we have the next nl. Even if both shifts are very fast and execute in parallel, they don't just add the execution cost of one shift to an iteration; they also change the dynamics of the loop's "pipeline", acting like a sort of synchronizing barrier.

If the two shifts finish at the same time, then all three multiplications for the next iteration will be ready to execute at the same time, and they can't all start in parallel, as explained above; they'll have to wait for one another, wasting cycles. This is the case on IntelSB, where the two shifts are equally fast (see below); I measured 8 cycles per iteration for this case.

If the two shifts don't finish at the same time, it will typically be the normal shift that finishes first (the composite shift is slower on most architectures). This means that the next nh will be ready early, so the top multiplication can start early for the next iteration. However, the other two multiplications still have to wait more (wasted) cycles for the composite shift to finish, and then they'll be ready at the same time and one will have to wait for the other to start, wasting some more time. This is the case on IntelH, measured by harold at 9 cycles per iteration.

I expect AMD to fall under this last category as well. While there's an even bigger difference in performance between the composite shift and normal shift on this platform, integer multiplications are also slower on AMD than on Intel (more than twice as slow), making the first sample slower to begin with. As a very rough estimate, I think the first version could take about 12 cycles on AMD, with the second one at around 16. It would be nice to have some concrete measurements, though.

Some more data on the troublesome composite shift, SHRD:

  • On IntelSB, it's exactly as cheap as a simple shift (great!); simple shifts are about as cheap as they come: they execute in one cycle, and two shifts can start executing each cycle.
  • On IntelH, SHRD takes 3 cycles to execute (yes, it got worse in the newer generation), and two shifts of any kind (simple or composite) can start executing each cycle;
  • On AMD, it's even worse. If I'm reading the data correctly, executing an SHRD keeps both shift execution units busy until execution finishes - no parallelism and no pipelining possible; it takes 3 cycles, during which no other shift can start executing.

What you can do about it

I can think of three possible improvements:

  1. replace SHRD with something faster on platforms where it makes sense;
  2. optimize the multiplication to take advantage of the data types involved here;
  3. restructure the loop.

1. SHRD can be replaced with two shifts and a bitwise OR, as described in the compiler section. A C++ implementation of a 128-bit shift right by two bits could look like this:

__int128_t shr2(__int128_t n)
{
   using std::int64_t;
   using std::uint64_t;

   //Unpack the two halves.
   int64_t nh = n >> 64;
   uint64_t nl = static_cast<uint64_t>(n);

   //Do the actual work.
   uint64_t rl = nl >> 2 | nh << 62;
   int64_t rh = nh >> 2;

   //Pack the result.
   return static_cast<__int128_t>(rh) << 64 | rl;
}
Run Code Online (Sandbox Code Playgroud)

Although it looks like a lot of code, only the middle section doing the actual work generates shifts and ORs. The other parts merely indicate to the compiler which 64-bit parts we want to work with; since the 64-bit parts are already in separate registers, those are effectively no-ops in the generated assembly code.

However, keep in mind that this amounts to "trying to write assembly using C++ syntax", and it's generally not a very bright idea. I'm only using it because I verified that it works for GCC and I'm trying to keep the amount of assembly code in this answer to a minimum. Even so, there's one surprise: the LLVM optimizer detects what we're trying to do with those two shifts and one OR and... replaces them with an SHRD in some cases (more about this below).

Functions of the same form can be used for shifts by other numbers of bits, less than 64. From 64 to 127, it gets easier, but the form changes. One thing to keep in mind is that it would be a mistake to pass the number of bits for shifting as a runtime parameter to a shr function. Shift instructions by a variable number of bits are slower than the ones using a constant number on most architectures. You could use a non-type template parameter to generate different functions at compile time - this is C++, after all...

I think using such a function makes sense on all architectures except IntelSB, where SHRD is already as fast as it can get. On AMD, it will definitely be an improvement. Less so on IntelH: for our case, I don't think it will make a difference, but generally it could shave once cycle off some calculations; there could theoretically be cases where it could make things slightly worse, but I think those are very uncommon (as usual, there's no substitute for measuring). I don't think it will make a difference for our loop because it will change things from [nh being ready after once cycle and nl after three] to [both being ready after two]; this means all three multiplications for the next iteration will be ready at the same time and they'll have to wait for one another, essentially wasting the cycle that was gained by the shift.

GCC seems to use SHRD on all architectures, and the "assembly in C++" code above can be used as an optimization where it makes sense. The LLVM optimizer uses a more nuanced approach: it does the optimization (replaces SHRD) automatically for AMD, but not for Intel, where it even reverses it, as mentioned above. This may change in future releases, as indicated by the discussion on the patch for LLVM that implemented this optimization. For now, if you want to use the alternative with LLVM on Intel, you'll have to resort to assembly code.

2. Optimizing the multiplication: The test code uses a 128-bit integer for i, but that's not needed in this case, as its value fits easily in 64 bits (32, actually, but that doesn't help us here). What this means is that ih will always be zero; this reduces the diagram for 128-bit multiplication to the following:

nh * il
        \
         \
          \
           + -> next nh
          /
    high 64 bits
        /
nl * il 
        \
     low 64 bits 
          \
            -> next nl
Run Code Online (Sandbox Code Playgroud)

Normally, I'd just say "declare i as long long and let the compiler optimize things" but unfortunately this doesn't work here; both compilers go for the standard behaviour of converting the two operands to their common type before doing the calculation, so i ends up on 128 bits even if it starts on 64. We'll have to do things the hard way:

__int128_t mul(__int128_t n, long long i)
{
   using std::int64_t;
   using std::uint64_t;

   //Unpack the two halves.
   int64_t nh = n >> 64;
   uint64_t nl = static_cast<uint64_t>(n);

   //Do the actual work.
   __asm__(R"(
    movq    %0, %%r10
    imulq   %2, %%r10
    mulq    %2
    addq    %%r10, %0
   )" : "+d"(nh), "+a"(nl) : "r"(i) : "%r10");

   //Pack the result.
   return static_cast<__int128_t>(nh) << 64 | nl;
}
Run Code Online (Sandbox Code Playgroud)

I said I tried to avoid assembly code in this answer, but it's not always possible. I managed to coax GCC into generating the right code with "assembly in C++" for the function above, but once the function is inlined everything falls apart - the optimizer sees what's going on in the complete loop body and converts everything to 128 bits. LLVM seems to behave in this case, but, since I was testing on GCC, I had to use a reliable way to get the right code in there.

Declaring i as long long and using this function instead of the normal multiplication operator, I measured 5 cycles per iteration for the first sample and 7 cycles for the second one on IntelSB, a gain of one cycle in each case. I expect it to shave one cycle off the iterations for both examples on IntelH as well.

3. The loop can sometimes be restructured to encourage pipelined execution, when (at least some) iterations don't really depend on previous results, even though it may look like they do. For example, we could replace the for loop for the second sample with something like this:

__int128_t n2 = 1;
long long j = 1000000000 / 2;
for(long long i = 1; i < 1000000000 / 2; ++i, ++j)
{
   n *= i;
   n2 *= j;
   n >>= 2;
   n2 >>= 2; 
}
n *= (n2 * j) >> 2;
Run Code Online (Sandbox Code Playgroud)

This takes advantage of the fact that some partial results can be calculated independently and only aggregated at the end. We're also hinting to the compiler that we want to pipeline the multiplications and shifts (not always necessary, but it does make a small difference for GCC for this code).

The code above is nothing more than a naive proof of concept. Real code would need to handle the total number of iterations in a more reliable way. The bigger problem is that this code won't generate the same results as the original, because of different behaviour in the presence of overflow and rounding. Even if we stop the loop on the 51st iteration, to avoid overflow, the result will still be different by about 10%, because of rounding happening in different ways when shifting right. In real code, this would most likely be a problem; but then again, you wouldn't have real code like this, would you?

Assuming this technique is applied to a case where the problems above don't occur, I measured the performance of such code in a few cases, again on IntelSB. The results are given in "cycles per iteration", as before, where "iteration" means the one from the original code (I divided the total


har*_*old 6

在我的4770K上对两者进行基准测试(使用GCC 4.7.3在-O2上生成的汇编)后,我发现第一个每次迭代需要5个周期,第二个每次迭代需要9个周期.为什么这么大的差异?

事实证明,这是吞吐量和延迟之间的相互作用.主要杀手是shrd,需要3个周期并处于关键路径上.这是它的图片(我忽略了链,i因为它更快,并且有足够的备用吞吐量,它只是向前运行,它不会干扰):

依赖链

这里的边缘是依赖关系,而不是数据流.

仅基于该链中的延迟,预期时间为每次迭代8个周期.但事实并非如此.这里的问题是发生8个周期,mul2并且imul3必须并行执行,并且整数乘法仅具有1 /周期的吞吐量.所以它(任何一个)必须等待一个循环,并按周期保持链.我通过将其更改为a imul来验证这一点add,这将每次迭代的时间减少到8个周期.将另一个更改imul为a add无效,正如基于此解释所预测的那样(它不依赖于shrd并且因此可以更早地安排,而不会干扰其他乘法).

这些确切的细节仅适用于Haswell.

我使用的代码是这样的:

section .text

global cmp1
proc_frame cmp1
[endprolog]
    mov r8, rsi
    mov r9, rdi
    mov esi, 1
    xor edi, edi
    mov eax, 128
    xor edx, edx
.L2:
    mov rcx, rdx
    mov rdx, rdi
    imul    rdx, rax
    imul    rcx, rsi
    add rcx, rdx
    mul rsi
    add rdx, rcx
    add rsi, 1
    mov rcx, rsi
    adc rdi, 0
    xor rcx, 10000000
    or  rcx, rdi
    jne .L2
    mov rdi, r9
    mov rsi, r8
    ret
endproc_frame

global cmp2
proc_frame cmp2
[endprolog]
    mov r8, rsi
    mov r9, rdi
    mov esi, 1
    xor edi, edi
    mov eax, 128
    xor edx, edx
.L3:
    mov rcx, rdi
    imul    rcx, rax
    imul    rdx, rsi
    add rcx, rdx
    mul rsi
    add rdx, rcx
    shrd    rax, rdx, 2
    sar rdx, 2
    add rsi, 1
    mov rcx, rsi
    adc rdi, 0
    xor rcx, 10000000
    or  rcx, rdi
    jne .L3
    mov rdi, r9
    mov rsi, r8
    ret
endproc_frame
Run Code Online (Sandbox Code Playgroud)