x64与x86有如此不同的性能结果的原因是什么?

410*_*one 15 .net c# performance compilation internal

我正在回答关于Code Review的问题,我发现x64和x86之间的性能(如很多)有一些有趣的差异.

class Program
{
    static void Main(string[] args)
    {
        BenchmarkRunner.Run<ModVsOptimization>();
        Console.ReadLine();
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    static public ulong Mersenne5(ulong dividend)
    {
        dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
        dividend = (dividend >> 16) + (dividend & 0xFFFF);
        dividend = (dividend >> 8) + (dividend & 0xFF);
        dividend = (dividend >> 4) + (dividend & 0xF);
        dividend = (dividend >> 4) + (dividend & 0xF);
        if (dividend > 14) { dividend = dividend - 15; } // mod 15
        if (dividend > 10) { dividend = dividend - 10; }
        if (dividend > 4) { dividend = dividend - 5; }
        return dividend;
    }
}

public class ModVsOptimization
{
    [Benchmark(Baseline = true)]
    public ulong RawModulo_5()
    {
        ulong r = 0;
        for (ulong i = 0; i < 1000; i++)
        {
            r += i % 5;
        }
        return r;
    }

    [Benchmark]
    public ulong OptimizedModulo_ViaMethod_5()
    {
        ulong r = 0;
        for (ulong i = 0; i < 1000; i++)
        {
            r += Program.Mersenne5(i);
        }
        return r;
    }
}
Run Code Online (Sandbox Code Playgroud)

86:

// * Summary *

BenchmarkDotNet=v0.10.8, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-5930K CPU 3.50GHz (Broadwell), ProcessorCount=12
Frequency=3415991 Hz, Resolution=292.7408 ns, Timer=TSC
  [Host]     : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.7.2098.0
  DefaultJob : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.7.2098.0


                      Method |     Mean |     Error |    StdDev | Scaled |
---------------------------- |---------:|----------:|----------:|-------:|
                 RawModulo_5 | 4.601 us | 0.0121 us | 0.0107 us |   1.00 |
 OptimizedModulo_ViaMethod_5 | 7.990 us | 0.0060 us | 0.0053 us |   1.74 |

// * Hints *
Outliers
  ModVsOptimization.RawModulo_5: Default                 -> 1 outlier  was  removed
  ModVsOptimization.OptimizedModulo_ViaMethod_5: Default -> 1 outlier  was  removed

// * Legends *
  Mean   : Arithmetic mean of all measurements
  Error  : Half of 99.9% confidence interval
  StdDev : Standard deviation of all measurements
  Scaled : Mean(CurrentBenchmark) / Mean(BaselineBenchmark)
  1 us   : 1 Microsecond (0.000001 sec)

// ***** BenchmarkRunner: End *****
Run Code Online (Sandbox Code Playgroud)

64位:

// * Summary *

BenchmarkDotNet=v0.10.8, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-5930K CPU 3.50GHz (Broadwell), ProcessorCount=12
Frequency=3415991 Hz, Resolution=292.7408 ns, Timer=TSC
  [Host]     : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2098.0
  DefaultJob : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2098.0


                      Method |     Mean |     Error |    StdDev | Scaled |
---------------------------- |---------:|----------:|----------:|-------:|
                 RawModulo_5 | 8.323 us | 0.0042 us | 0.0039 us |   1.00 |
 OptimizedModulo_ViaMethod_5 | 2.597 us | 0.0956 us | 0.0982 us |   0.31 |

// * Hints *
Outliers
  ModVsOptimization.OptimizedModulo_ViaMethod_5: Default -> 2 outliers were removed

// * Legends *
  Mean   : Arithmetic mean of all measurements
  Error  : Half of 99.9% confidence interval
  StdDev : Standard deviation of all measurements
  Scaled : Mean(CurrentBenchmark) / Mean(BaselineBenchmark)
  1 us   : 1 Microsecond (0.000001 sec)

// ***** BenchmarkRunner: End *****
Run Code Online (Sandbox Code Playgroud)

现在这里是有趣的部分,这并不一定让我感到惊讶(由于我特别是C#编译器工作的方式),x86和x64程序集对于该RawModulo_5方法都具有相同的IL :

.method public hidebysig instance uint64 
        RawModulo_5() cil managed
{
  .custom instance void [BenchmarkDotNet.Core]BenchmarkDotNet.Attributes.BenchmarkAttribute::.ctor() = ( 01 00 01 00 54 02 08 42 61 73 65 6C 69 6E 65 01 ) // ....T..Baseline.
  // Code size       31 (0x1f)
  .maxstack  3
  .locals init ([0] uint64 r,
           [1] uint64 i)
  IL_0000:  ldc.i4.0
  IL_0001:  conv.i8
  IL_0002:  stloc.0
  IL_0003:  ldc.i4.0
  IL_0004:  conv.i8
  IL_0005:  stloc.1
  IL_0006:  br.s       IL_0014
  IL_0008:  ldloc.0
  IL_0009:  ldloc.1
  IL_000a:  ldc.i4.5
  IL_000b:  conv.i8
  IL_000c:  rem.un
  IL_000d:  add
  IL_000e:  stloc.0
  IL_000f:  ldloc.1
  IL_0010:  ldc.i4.1
  IL_0011:  conv.i8
  IL_0012:  add
  IL_0013:  stloc.1
  IL_0014:  ldloc.1
  IL_0015:  ldc.i4     0x3e8
  IL_001a:  conv.i8
  IL_001b:  blt.un.s   IL_0008
  IL_001d:  ldloc.0
  IL_001e:  ret
} // end of method ModVsOptimization::RawModulo_5
Run Code Online (Sandbox Code Playgroud)

现在我不知道接下来会在哪里看,但我怀疑问题出在JITter的某个地方,虽然我在RyuJIT和LegacyJIT上进行了测试,但两者在x64体系结构上都有相同的一般结果(尽管LegacyJIT 总体上略慢).这些是在Visual Studio 外部的发布模式下运行的,所以我假设没有附加的调试会话导致它.

所以我很好奇,是什么导致了这个?我不知道如何进一步调查,但如果有人对进一步的调查步骤有任何想法,请随时发表评论,我很乐意尝试执行它们.

oze*_*nix 14

我想对生成的汇编代码进行分析,看看发生了什么.我抓住你的示例代码并在发布模式下运行它.这是使用Visual Studio 2015与.NET Framework 4.5.2.CPU是Intel Ivy Bridge i5-3570K,以防JIT进行非常具体的优化.我运行相同的测试,但没有你的基准测试套件,只需使用一个简单的,Stopwatch并按时间划分迭代次数.这是我观察到的:

RawModulo_5, x86:                 13721978 ticks, 13.721978 ticks per iteration
OptimizedModulo_ViaMethod_5, x86: 24641039 ticks, 24.641039 ticks per iteration

RawModulo_5, x64:                 23275799 ticks, 23.275799 ticks per iteration
OptimizedModulo_ViaMethod_5, x64: 13389012 ticks, 13.389012 ticks per iteration
Run Code Online (Sandbox Code Playgroud)

这与您的测量结果略有不同 - 每种方法的性能或多或少都会根据x86与x64的不同而翻转.您的测量结果存在更明显的差异,特别是在每个实现与其他对应项之间.RawModulo_5x64中的速度略慢一倍,而x64 OptimizedModulo_ViaMethod_5则快3.7倍!

Also, I hope you're not expecting the outputs of RawModulo_5 and OptimizedModulo_ViaMethod_5 to be equal, because they are not! The correct Mersenne5 implementation is below:

static public ulong Mersenne5(ulong dividend)
{
    dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
    dividend = (dividend >> 16) + (dividend & 0xFFFF);
    dividend = (dividend >> 8) + (dividend & 0xFF);
    dividend = (dividend >> 4) + (dividend & 0xF);
    // there was an extra shift by 4 here
    if (dividend > 14) { dividend = dividend - 15; } // mod 15
    // the 9 used to be a 10
    if (dividend > 9) { dividend = dividend - 10; }
    if (dividend > 4) { dividend = dividend - 5; }
    return dividend;
}
Run Code Online (Sandbox Code Playgroud)

To gather the instructions on my system, I added a System.Diagnostics.Debugger.Break() within each method, just before the loops and the body of Mersenne5, so that I'd have a definite break point to grab the generated assembly. By the way, you can grab generated assembly code from the Visual Studio UI - if you're at a breakpoint you can right click the code editor window and select "Go To Disassembly" from the context menu. I've annotated the assembly to explain what it's doing. Sorry for the crazy syntax highlighting.

x86, RawModulo_5 method

            System.Diagnostics.Debugger.Break();
00242DA2  in          al,dx  
00242DA3  push        edi  
00242DA4  push        ebx  
00242DA5  sub         esp,10h  
00242DA8  call        6D4C0178  
            ulong r = 0;
00242DAD  mov         dword ptr [ebp-10h],0  ; setting the low and high dwords of 'r'
00242DB4  mov         dword ptr [ebp-0Ch],0  
            for (ulong i = 0; i < 1000; i++)
; set the high dword of 'i' to 0
00242DBB  mov         dword ptr [ebp-14h],0
; clear the low dword of 'i' to 0 - the compiler is using 'edi' as the loop iteration var
00242DC2  xor         edi,edi  
            {
                r += i % 5;
00242DC4  mov         eax,edi  
00242DC6  mov         edx,dword ptr [ebp-14h]  
; edx:eax together are the high and low dwords of 'i', respectively

; this is a short circuit trick so it can avoid working with the high
; dword - you can see it jumps halfway in to the div/mod operation below
00242DC9  mov         ecx,5  
00242DCE  cmp         edx,ecx  
00242DD0  jb          00242DDC  
; 64 bit div/mod operation
00242DD2  mov         ebx,eax  
00242DD4  mov         eax,edx  
00242DD6  xor         edx,edx  
00242DD8  div         eax,ecx  
00242DDA  mov         eax,ebx  
00242DDC  div         eax,ecx  
00242DDE  mov         eax,edx  
00242DE0  xor         edx,edx
; load the current low and high dwords from 'r', then add into
; edx:eax as a pair forming a qword
00242DE2  add         eax,dword ptr [ebp-10h]  
00242DE5  adc         edx,dword ptr [ebp-0Ch]  
; store the result back in 'r'
00242DE8  mov         dword ptr [ebp-10h],eax  
00242DEB  mov         dword ptr [ebp-0Ch],edx  
            for (ulong i = 0; i < 1000; i++)
; load the loop variable low and high dwords into edx:eax
00242DEE  mov         eax,edi  
00242DF0  mov         edx,dword ptr [ebp-14h]  
; increment eax (the low dword) and propagate any carries to
; edx (the high dword)
00242DF3  add         eax,1  
00242DF6  adc         edx,0  
; store the low and high dwords back to the high word of 'i' and
; the loop iteration counter, 'edi'
00242DF9  mov         dword ptr [ebp-14h],edx  
00242DFC  mov         edi,eax
; test the high dword  
00242DFE  cmp         dword ptr [ebp-14h],0  
00242E02  ja          00242E0E  
00242E04  jb          00242DC4  
; (int) i < 1000
00242E06  cmp         edi,3E8h  
00242E0C  jb          00242DC4  
            }
            return r;
; retrieve the current value of 'r' from memory, return value is
; in edx:eax since the return value is 64 bits
00242E0E  mov         eax,dword ptr [ebp-10h]  
00242E11  mov         edx,dword ptr [ebp-0Ch]  
00242E14  lea         esp,[ebp-8]  
00242E17  pop         ebx  
00242E18  pop         edi  
00242E19  pop         ebp  
00242E1A  ret  
Run Code Online (Sandbox Code Playgroud)

x86, OptimizedModulo_ViaMethod_5

            System.Diagnostics.Debugger.Break();
00242E33  push        edi  
00242E34  push        esi  
00242E35  push        ebx  
00242E36  sub         esp,8  
00242E39  call        6D4C0178  
            ulong r = 0;
; same as above, initialize 'r' to zero using low and high dwords
00242E3E  mov         dword ptr [ebp-10h],0  
; this time we're using edi:esi as the loop counter, rather than
; edi and a memory location. probably less register pressure in this
; function, for reasons we'll see...
00242E45  xor         ebx,ebx  
            for (ulong i = 0; i < 1000; i++)
; initialize 'i' to 0, esi is the loop counter low dword, edi is the high dword
00242E47  xor         esi,esi  
00242E49  xor         edi,edi  
; push 'i' to the stack, high word then low word
00242E4B  push        edi  
00242E4C  push        esi  
; call Mersenne5 - it got put in the data section since it's static
00242E4D  call        dword ptr ds:[3D7830h]  
; return value comes back as edx:eax, where edx is the high dword
; ebx is the existing low dword of 'r', so it's accumulated into eax
00242E53  add         eax,ebx  
; the high dword of 'r' is at ebp-10, that gets accumulated to edx with
; the carry result of the last add since it's 64 bits wide
00242E55  adc         edx,dword ptr [ebp-10h]
; store edx:ebx back to 'r'  
00242E58  mov         dword ptr [ebp-10h],edx  
00242E5B  mov         ebx,eax  
; increment the loop counter and carry to edi as well, 64 bit add
00242E5D  add         esi,1  
00242E60  adc         edi,0  
; make sure edi == 0 since it's the high dword
00242E63  test        edi,edi  
00242E65  ja          00242E71  
00242E67  jb          00242E4B  
; (int) i < 1000
00242E69  cmp         esi,3E8h  
00242E6F  jb          00242E4B  
            }
            return r;
; move 'r' to edx:eax to return them
00242E71  mov         eax,ebx  
00242E73  mov         edx,dword ptr [ebp-10h]  
00242E76  lea         esp,[ebp-0Ch]  
00242E79  pop         ebx  
00242E7A  pop         esi  
00242E7B  pop         edi  
00242E7C  pop         ebp  
00242E7D  ret  
Run Code Online (Sandbox Code Playgroud)

x86, Mersenne5() method

            System.Diagnostics.Debugger.Break();
00342E92  in          al,dx  
00342E93  push        edi  
00342E94  push        esi  
; esi is the low dword, edi is the high dword of the 64 bit argument
00342E95  mov         esi,dword ptr [ebp+8]  
00342E98  mov         edi,dword ptr [ebp+0Ch]  
00342E9B  call        6D4C0178  
            dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
; this is a LOT of instructions for each step, but at least it's all registers.

; copy edi:esi to edx:eax
00342EA0  mov         eax,esi  
00342EA2  mov         edx,edi
; clobber eax with edx, so now both are the high word. this is a
; shorthand for a 32 bit shift right of a 64 bit number.  
00342EA4  mov         eax,edx
; clear the high word now that we've moved the high word to the low word  
00342EA6  xor         edx,edx
; clear the high word of the original 'dividend', same as masking the low 32 bits  
00342EA8  xor         edi,edi  
; (dividend >> 32) + (dividend & 0xFFFFFFFF)
; it's a 64 bit add, so it's the usual add/adc
00342EAA  add         eax,esi  
00342EAC  adc         edx,edi
; 'dividend' now equals the temporary "variable" that held the addition result  
00342EAE  mov         esi,eax  
00342EB0  mov         edi,edx  
            dividend = (dividend >> 16) + (dividend & 0xFFFF);
; same idea as above, but with an actual shift and mask since it's not 32 bits wide
00342EB2  mov         eax,esi  
00342EB4  mov         edx,edi  
00342EB6  shrd        eax,edx,10h  
00342EBA  shr         edx,10h  
00342EBD  and         esi,0FFFFh  
00342EC3  xor         edi,edi  
00342EC5  add         eax,esi  
00342EC7  adc         edx,edi  
00342EC9  mov         esi,eax  
00342ECB  mov         edi,edx  
            dividend = (dividend >> 8) + (dividend & 0xFF);
; same idea, keep going down...
00342ECD  mov         eax,esi  
00342ECF  mov         edx,edi  
00342ED1  shrd        eax,edx,8  
00342ED5  shr         edx,8  
00342ED8  and         esi,0FFh  
00342EDE  xor         edi,edi  
00342EE0  add         eax,esi  
00342EE2  adc         edx,edi  
00342EE4  mov         esi,eax  
00342EE6  mov         edi,edx  
            dividend = (dividend >> 4) + (dividend & 0xF);
00342EE8  mov         eax,esi  
00342EEA  mov         edx,edi  
00342EEC  shrd        eax,edx,4  
00342EF0  shr         edx,4  
00342EF3  and         esi,0Fh  
00342EF6  xor         edi,edi  
00342EF8  add         eax,esi  
00342EFA  adc         edx,edi  
00342EFC  mov         esi,eax  
00342EFE  mov         edi,edx  
            dividend = (dividend >> 4) + (dividend & 0xF);
00342F00  mov         eax,esi  
00342F02  mov         edx,edi  
00342F04  shrd        eax,edx,4  
00342F08  shr         edx,4  
00342F0B  and         esi,0Fh  
00342F0E  xor         edi,edi  
00342F10  add         eax,esi  
00342F12  adc         edx,edi  
00342F14  mov         esi,eax  
00342F16  mov         edi,edx  
            if (dividend > 14) { dividend = dividend - 15; } // mod 15
; conditional subtraction
00342F18  test        edi,edi  
00342F1A  ja          00342F23  
00342F1C  jb          00342F29  
; 'dividend' > 14
00342F1E  cmp         esi,0Eh  
00342F21  jbe         00342F29  
; 'dividend' = 'dividend' - 15
00342F23  sub         esi,0Fh 
; subtraction borrow from high word 
00342F26  sbb         edi,0  
            if (dividend > 10) { dividend = dividend - 10; }
; same gist for the next two
00342F29  test        edi,edi  
00342F2B  ja          00342F34  
00342F2D  jb          00342F3A  
00342F2F  cmp         esi,0Ah  
00342F32  jbe         00342F3A  
00342F34  sub         esi,0Ah  
00342F37  sbb         edi,0  
            if (dividend > 4) { dividend = dividend - 5; }
00342F3A  test        edi,edi  
00342F3C  ja          00342F45  
00342F3E  jb          00342F4B  
00342F40  cmp         esi,4  
00342F43  jbe         00342F4B  
00342F45  sub         esi,5  
00342F48  sbb         edi,0  
            return dividend;
; move edi:esi into edx:eax for return
00342F4B  mov         eax,esi  
00342F4D  mov         edx,edi  
00342F4F  pop         esi  
00342F50  pop         edi  
00342F51  pop         ebp  
00342F52  ret         8  
Run Code Online (Sandbox Code Playgroud)

The first big thing I notice is that Mersenne5 is not actually getting inlined, even though it's listed tagged as AggressiveInlining. I'm guessing this is because inlining the function inside OptimizedModulo_ViaMethod_5 would cause horrific register spilling, and the large amount of memory reads and writes would completely destroy the point of inlining the method, so the compiler elected (quite wisely!) not to do so.

Second, Mersenne5 is getting call'd 1000 times by OptimizedModulo_ViaMethod_5, so there's 1000 pieces of extra call/ret overhead being experienced, including the necessary pushes and pops to save register states across the call boundary. RawModulo_5 doesn't make any calls outside, and even the 64 bit division is optimized a bit so it skips the high dword where it can.

x64, RawModulo_5 method

            System.Diagnostics.Debugger.Break();
000007FE98C93CF0  sub         rsp,28h  
000007FE98C93CF4  call        000007FEF7B079C0  
            ulong r = 0;
; the compiler knows the high dword of rcx is already 0, so it just
; zeros the low dword. this is 'r'
000007FE98C93CF9  xor         ecx,ecx  
            for (ulong i = 0; i < 1000; i++)
; same here, this is 'i'
000007FE98C93CFB  xor         r8d,r8d  
            {
                r += i % 5;
; load 5 as a dword to the low dword of r9
000007FE98C93CFE  mov         r9d,5  
; copy the loop counter to rax for the div below
000007FE98C93D04  mov         rax,r8  
; clear the lower dword of rdx, upper dword is clear already
000007FE98C93D07  xor         edx,edx  
; 64 bit div/mod in one instruction! but it's slow!
000007FE98C93D09  div         rax,r9  
; rax = quotient, rdx = remainder
; throw away the quotient since we're just doing mod, and accumulate the
; modulus into 'r'
000007FE98C93D0C  add         rcx,rdx  
            for (ulong i = 0; i < 1000; i++)
; 64 bit increment to the loop counter
000007FE98C93D0F  inc         r8  
; i < 1000
000007FE98C93D12  cmp         r8,3E8h  
000007FE98C93D19  jb          000007FE98C93CFE  
            }
            return r;
; return 'r' in rax, since we can directly return a 64 bit var in one register now
000007FE98C93D1B  mov         rax,rcx  
000007FE98C93D1E  add         rsp,28h  
000007FE98C93D22  ret  
Run Code Online (Sandbox Code Playgroud)

x64,OptimizedModulo_ViaMethod_5

            System.Diagnostics.Debugger.Break();
000007FE98C94040  push        rdi  
000007FE98C94041  push        rsi  
000007FE98C94042  sub         rsp,28h  
000007FE98C94046  call        000007FEF7B079C0  
            ulong r = 0;
; same general loop setup as above
000007FE98C9404B  xor         esi,esi  
            for (ulong i = 0; i < 1000; i++)
; 'edi' is the loop counter
000007FE98C9404D  xor         edi,edi  
; put rdi in rcx, which is the x64 register used for the first argument
; in a call
000007FE98C9404F  mov         rcx,rdi  
; call Mersenne5 - still no actual inlining!
000007FE98C94052  call        000007FE98C90F40  
; accumulate 'r' with the return value of Mersenne5
000007FE98C94057  add         rax,rsi  
; store back to 'r' - I don't know why in the world the compiler did this
; seems like add rsi, rax would be better, but maybe there's a pipelining
; issue I'm not seeing.
000007FE98C9405A  mov         rsi,rax  
; increment loop counter
000007FE98C9405D  inc         rdi  
; i < 1000
000007FE98C94060  cmp         rdi,3E8h  
000007FE98C94067  jb          000007FE98C9404F  
            }
            return r;
; put return value in rax like before
000007FE98C94069  mov         rax,rsi  
000007FE98C9406C  add         rsp,28h  
000007FE98C94070  pop         rsi  
000007FE98C94071  pop         rdi  
000007FE98C94072  ret  
Run Code Online (Sandbox Code Playgroud)

x64,Mersenne5方法

            System.Diagnostics.Debugger.Break();
000007FE98C94580  push        rsi  
000007FE98C94581  sub         rsp,20h  
000007FE98C94585  mov         rsi,rcx  
000007FE98C94588  call        000007FEF7B079C0  
            dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
; pretty similar to before actually, except this time we do a real
; shift and mask for the 32 bit part
000007FE98C9458D  mov         rax,rsi  
; 'dividend' >> 32
000007FE98C94590  shr         rax,20h  
; hilariously, we have to load the mask into edx first. this is because
; there is no AND r/64, imm64 in x64
000007FE98C94594  mov         edx,0FFFFFFFFh  
000007FE98C94599  and         rsi,rdx  
; add the shift and the masked versions together
000007FE98C9459C  add         rax,rsi  
000007FE98C9459F  mov         rsi,rax  
            dividend = (dividend >> 16) + (dividend & 0xFFFF);
; same logic continues down
000007FE98C945A2  mov         rax,rsi  
000007FE98C945A5  shr         rax,10h  
000007FE98C945A9  mov         rdx,rsi  
000007FE98C945AC  and         rdx,0FFFFh  
000007FE98C945B3  add         rax,rdx  

; note the redundant moves that happen every time, rax into rsi, rsi
; into rax. so there's still not ideal x64 being generated.
000007FE98C945B6  mov         rsi,rax  
            dividend = (dividend >> 8) + (dividend & 0xFF);
000007FE98C945B9  mov         rax,rsi  
000007FE98C945BC  shr         rax,8  
000007FE98C945C0  mov         rdx,rsi  
000007FE98C945C3  and         rdx,0FFh  
000007FE98C945CA  add         rax,rdx  
000007FE98C945CD  mov         rsi,rax  
            dividend = (dividend >> 4) + (dividend & 0xF);
000007FE98C945D0  mov         rax,rsi  
000007FE98C945D3  shr         rax,4  
000007FE98C945D7  mov         rdx,rsi  
000007FE98C945DA  and         rdx,0Fh  
000007FE98C945DE  add         rax,rdx  
000007FE98C945E1  mov         rsi,rax  
            dividend = (dividend >> 4) + (dividend & 0xF);
000007FE98C945E4  mov         rax,rsi  
000007FE98C945E7  shr         rax,4  
000007FE98C945EB  mov         rdx,rsi  
000007FE98C945EE  and         rdx,0Fh  
000007FE98C945F2  add         rax,rdx  
000007FE98C945F5  mov         rsi,rax  
            if (dividend > 14) { dividend = dividend - 15; } // mod 15
; notice the difference in jumping logic - the pairs of jumps are now singles
000007FE98C945F8  cmp         rsi,0Eh  
000007FE98C945FC  jbe         000007FE98C94602 
; using a single 64 bit add instead of a subtract, the immediate constant
; is the 2's complement of 15. this is okay because there's no borrowing
; to do since we can do the entire sub in one operation to one register. 
000007FE98C945FE  add         rsi,0FFFFFFFFFFFFFFF1h  
            if (dividend > 10) { dividend = dividend - 10; }
000007FE98C94602  cmp         rsi,0Ah  
000007FE98C94606  jbe         000007FE98C9460C  
000007FE98C94608  add         rsi,0FFFFFFFFFFFFFFF6h  
            if (dividend > 4) { dividend = dividend - 5; }
000007FE98C9460C  cmp         rsi,4  
000007FE98C94610  jbe         000007FE98C94616  
000007FE98C94612  add         rsi,0FFFFFFFFFFFFFFFBh  
            return dividend;
000007FE98C94616  mov         rax,rsi  
000007FE98C94619  add         rsp,20h  
000007FE98C9461D  pop         rsi  
000007FE98C9461E  ret  
Run Code Online (Sandbox Code Playgroud)


所有x64方法看起来都比x86方法好,但仍然存在这样的问题:RawModulo_5与x86相比,为什么x64的速度是x64的两倍,尤其是为什么OptimizedModulo_ViaMethod_5x64下x86的速度几乎快4倍.为了得到完整的解释,我认为我们需要像Peter Cordes这样的人 - 他在指导时间和流水线方面比我更有知识.以下是我对优势和劣势来自何处的直觉.

  1. divx86中的[x64 con]与x64相关RawModulo_5

    According to the instruction tables provided by Agner Fog here, on Broadwell a 32 bit div takes 10 micro-ops and has a latency of 22 to 29 clocks, while 64 bit div takes 36 micro-ops and has a latency of 32 to 95 clocks.

    编译器还在x86 RawModulo_5中进行了优化,div在每种情况下绕过高dword ,因为循环保持在下面int.MaxValue,所以实际上它只是div在每次迭代时执行一个32位.因此,64位div延迟比32位div延迟高1.45到3.27倍.两个版本都对结果有完全依赖性div,因此x64代码由于更高的延迟而支付更大的性能损失.我敢说,在x86 RawModulo_5中添加64位的add/adc指令对于64位宽的巨大性能劣势是一个很小的惩罚div.

  2. [x64 pro]减少x64中的呼叫开销OptimizedModulo_ViaMethod_5

    This is probably not a huge difference in terms of performance, but it's worth mentioning. Because OptimizedModulo_ViaMethod_5 is calling Mersenne5 1000 times in both versions, the 64 bit version is paying far less a penalty in terms of the standard x86 versus x64 calling convention. Consider that the x86 version has to push two registers to the stack to pass a 64 bit variable, then Mersenne5 has to preserve esi and edi, then pull the high and low dwords out of the stack for edx and eax respectively. At the end, Mersenne5 has to restore esi and edi. In the x64 version, the value of i is passed in ecx directly, so no memory access is involved at all. The x64 Mersenne5 only saves and restores rsi, the other registers are clobbered.

  3. [x64 pro] Many fewer instructions in x64 Mersenne5

    Mersenne5 is more efficient in x64 as it can perform all the operations on the 64 bit dividend in single instructions, versus requiring pairs of instructions in x86 for the mov and add/adc operations. I have a hunch that the dependency chains are better in x64 as well, but I am not knowledgeable enough to speak on that subject.

  4. [x64 pro] Better jump behavior in x64 Mersenne5

    The three conditional subtractions that Mersenne5 does at the end are implemented much better under x64 than x86. On x86, each one has two comparisons and three possible conditional jumps that can be taken. On x64, there is only one comparison and one conditional jump, which is undoubtedly more efficient.


With those points in mind, it makes some sense for Ivy Bridge we'd see the performance of each flip-flop from x86 to x64. It's likely that the 64 bit division latency penalty (which is a little worse on Ivy Bridge than Broadwell, but not much) is hurting RawModulo_5 quite a bit, and the near halving of instructions in Mersenne5 is speeding up OptimizedModulo_ViaMethod_5 at the same time.

What doesn't make sense is the results on Broadwell - I'm still a little surprised how much faster the x64 OptimizedModulo_ViaMethod_5 is, even compared to the x86 RawModulo_5. I imagine the answer would be micro-op fusion and pipelining for the Mersenne5 method is considerably better on x64, or perhaps the JIT on your architecture is using Broadwell-specific knowledge to output very different instructions.

I'm sorry I can't give a more conclusive answer, but I hope the analysis above is enlightening as to why there's a difference between the two methods and the two architectures.

By the way, if you want to see what a truly inlined version can do, here you go:

RawModulo_5, x86:                  13722506 ticks, 13.722506 ticks per iteration
OptimizedModulo_ViaMethod_5, x86:  23640994 ticks, 23.640994 ticks per iteration
OptimizedModulo_TrueInlined, x86:  21488012 ticks, 21.488012 ticks per iteration
OptimizedModulo_TrueInlined2, x86: 21645697 ticks, 21.645697 ticks per iteration

RawModulo_5, x64:                 22175326 ticks, 22.175326 ticks per iteration
OptimizedModulo_ViaMethod_5, x64: 12822574 ticks, 12.822574 ticks per iteration
OptimizedModulo_TrueInlined, x64:  7612328 ticks,  7.612328 ticks per iteration
OptimizedModulo_TrueInlined2, x64: 7591190 ticks,  7.59119 ticks per iteration
Run Code Online (Sandbox Code Playgroud)

And the code:

public ulong OptimizedModulo_TrueInlined()
{
    ulong r = 0;
    ulong dividend = 0;

    for (ulong i = 0; i < 1000; i++)
    {
        dividend = i;
        dividend = (dividend >> 32) + (dividend & 0xFFFFFFFF);
        dividend = (dividend >> 16) + (dividend & 0xFFFF);
        dividend = (dividend >> 8) + (dividend & 0xFF);
        dividend = (dividend >> 4) + (dividend & 0xF);
        dividend = (dividend >> 4) + (dividend & 0xF);
        if (dividend > 14) { dividend = dividend - 15; } // mod 15
        if (dividend > 10) { dividend = dividend - 10; }
        if (dividend > 4) { dividend = dividend - 5; }
        r += dividend;
    }
    return r;
}

public ulong OptimizedModulo_TrueInlined2()
{
    ulong r = 0;
    ulong dividend = 0;

    for (ulong i = 0; i < 1000; i++)
    {
        dividend = (i >> 32) + (i & 0xFFFFFFFF);
        dividend = (dividend >> 16) + (dividend & 0xFFFF);
        dividend = (dividend >> 8) + (dividend & 0xFF);
        dividend = (dividend >> 4) + (dividend & 0xF);
        dividend = (dividend >> 4) + (dividend & 0xF);
        if (dividend > 14) { dividend = dividend - 15; } // mod 15
        if (dividend > 10) { dividend = dividend - 10; }
        if (dividend > 4) { dividend = dividend - 5; }
        r += dividend;
    }
    return r;
}
Run Code Online (Sandbox Code Playgroud)

  • @EBrown答案没有提到这一点,但你*需要*以确保取消选中Tools-> Options-> Debugging-> General-> Suppress JIT对模块加载优化的设置,如果你想看到优化的调试器下的发布模式下的代码. (2认同)

Han*_*ant 13

  r += i % 5;
Run Code Online (Sandbox Code Playgroud)

这是代码片段中的瓶颈声明,正如@ozeanix所解释的那样.我会诠释他的广泛答案.

除法是处理器必须执行的硬操作之一,没有已知的数字电路可以在单个周期内执行除法.它必须采用迭代方法实施,与您在小学中学习的方式没有根本的不同.执行时间与位数成正比,64位除法可以预期为32位除法的两倍.

x86抖动,不得不生成繁琐的代码,仅使用32位寄存器进行数学计算,它采用了一个快捷方式,其中高32位为ulong0.这在特定情况下表现良好,999和5足够小.请注意Mersenne5()方法上64位代码的速度要快得多,能够使用单个寄存器来存储中间值,并且单个移位指令一次移动64位会使它大大增加.

x64抖动不能使用x86抖动使用的相同技巧,而不是使代码更慢,64位寄存器的高32位不能直接寻址.这并不意味着你坚持使用较慢的性能,并且可以充分信任任何猪可以飞行.我将展示一个我从C编译器优化器反向设计的编码技巧.它适用于这种特定情况,因为您反复使用相同的除数.为了说明这个技巧,这是这样的编译器在其内部循环中生成的机器代码,其中循环展开和指令混合被删除:

00007FF603121006  mov         rax,0CCCCCCCCCCCCCCCDh    ; magic!
00007FF603121010  mul         rax,r9                    ; magic * i
00007FF603121013  shr         rdx,2                     ; rdx = (magic * i) / 4 / 2^64 
00007FF603121017  lea         rcx,[rdx+rdx*4]           ; 5 * rdx
00007FF60312101B  mov         rdx,r9                    ; i
00007FF60312101E  sub         rdx,rcx                   ; i - 5 * ((magic * i) / 4 / 2^64)
00007FF603121024  add         r8,rdx                    ; r += i % 5
Run Code Online (Sandbox Code Playgroud)

这是,咳嗽,难以理解.关键是代码根本不使用DIV指令,但可以使用SHR执行,这使得它非常快.SHR与>>C#中的运算符完全等价,右移相当于除以2的幂.

最大的诀窍是将除法变换为5除以2的幂.这通常不可能,但它可以近似.这需要一些重写技巧.它以将模数转换为除法的身份开始:

A % B == A - B * (A / B)
Run Code Online (Sandbox Code Playgroud)

通过将左侧和右侧乘以N/B来转换除法,其中N是2的方便幂:

A % B == A - B * ((A * N / B) / N)
Run Code Online (Sandbox Code Playgroud)

由于N/B在前面是已知的,因此它可以从循环中提升.我应该强调,这个身份只适用于浮点除法.我们想要使用整数除法.从而:

A % B ~= A - B * (A * K / N)   where K ~= N / B
Run Code Online (Sandbox Code Playgroud)

我们为N选择的值越大,K的近似就越准确.C编译器代码对N,4*2 ^ 64 使用非常大的值,利用64位乘法产生128位结果.我们不能在C#中做的事情,我们必须为N选择一个足够小的值,以便结果永远不会溢出.在辅助类中编码此方法:

public class FastModulo {
    public FastModulo(ulong maxdividend, ulong divisor) {
        div = divisor;
        int dividendbits = 1 + (int)(Math.Log(maxdividend - 1) / Math.Log(2));
        shift = 64 - dividendbits;
        mult = (ulong)Math.Round((double)(1UL << shift) / divisor);
        //TODO: verify that the approximation is accurate enough.
    }
    public ulong Modulo(ulong value) {
        return value - (div * ((value * mult) >> shift));
    }
    int shift;
    ulong mult, div;
}
Run Code Online (Sandbox Code Playgroud)

并使用它:

public ulong RawModulo_5() {
    var fm = new FastModulo(1000, 5);
    ulong r = 0;
    for (uint i = 0; i < 1000; i++) {
        r += fm.Modulo(i);
    }
}
Run Code Online (Sandbox Code Playgroud)

或者可读性较差:

        r += i - (5 * ((i * 3602879701896397UL) >> 54));
Run Code Online (Sandbox Code Playgroud)

它在64位模式下要快得多(不要在32中使用),我看到我的手机Haswell有一个粗略的x3改进.通过将2个倍数,一个移位和一个减法代替昂贵的多周期除法来实现.每个只需1个周期.

有一个// TODO,它需要检查以验证当被除数或除数变得太大时近似值不会导致错误.不是100%肯定如何正确地做到这一点,模块化数学让我头疼.但我敢肯定大多数程序员认为这是一个好奇而不是实用的代码:)如果有人想挖掘然后请编辑代码添加检查,否则只需运行代码两种方式来验证结果是否相同.