为什么JIT订单会影响性能?

Edw*_*rey 34 .net c# compiler-construction performance jit

为什么.NET 4.0中的C#方法即时编译的顺序会影响它们执行的速度?例如,考虑两种等效方法:

public static void SingleLineTest()
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();
    int count = 0;
    for (uint i = 0; i < 1000000000; ++i) {
        count += i % 16 == 0 ? 1 : 0;
    }
    stopwatch.Stop();
    Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
}

public static void MultiLineTest()
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();
    int count = 0;
    for (uint i = 0; i < 1000000000; ++i) {
        var isMultipleOf16 = i % 16 == 0;
        count += isMultipleOf16 ? 1 : 0;
    }
    stopwatch.Stop();
    Console.WriteLine("Multi-line test  --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
}
Run Code Online (Sandbox Code Playgroud)

唯一的区别是引入了一个局部变量,它影响生成的汇编代码和循环性能.为什么会出现这种情况本身就是一个问题.

可能更奇怪的是,在x86(但不是x64)上,调用方法的顺序对性能的影响大约为20%.调用这样的方法......

static void Main()
{
    SingleLineTest();
    MultiLineTest();
}
Run Code Online (Sandbox Code Playgroud)

......而且SingleLineTest速度更快.(使用x86 Release配置进行编译,确保启用"优化代码"设置,并从VS2010外部运行测试.)但是反转命令...

static void Main()
{
    MultiLineTest();
    SingleLineTest();
}
Run Code Online (Sandbox Code Playgroud)

......两种方法都需要相同的时间(几乎,但不是很长,只要MultiLineTest以前).(当运行此测试时,添加一些额外的调用SingleLineTest以及MultiLineTest获取其他样本是有用的.除了首先调用哪个方法之外,多少和什么顺序无关紧要.)

最后,为了证明JIT命令很重要,MultiLineTest请先离开,但要先强行 JIT SingleLineTest...

static void Main()
{
    RuntimeHelpers.PrepareMethod(typeof(Program).GetMethod("SingleLineTest").MethodHandle);
    MultiLineTest();
    SingleLineTest();
}
Run Code Online (Sandbox Code Playgroud)

现在,SingleLineTest又快了.

如果在VS2010中关闭"抑制模块加载时的JIT优化",则可以放入断点SingleLineTest并查看循环中的汇编代码是否相同,而不管JIT顺序如何; 但是,方法开头的汇编代码会有所不同.但是,当大部分时间花在循环中时,这是多么重要,这是令人困惑的.

一个示例项目演示这种行为是在github.

目前尚不清楚这种行为如何影响实际应用程序.一个问题是它可以使性能调整变得易失,这取决于首先调用的顺序方法.使用分析器很难检测到这种问题.一旦找到热点并优化了算法,就很难在没有大量猜测的情况下知道并检查JITing方法是否可以提前进行额外的加速.

更新:另请参阅此问题的Microsoft Connect条目.

Ben*_*igt 25

请注意,我不相信"在模块加载时抑制JIT优化"选项,我在没有调试的情况下生成进程并在JIT运行后附加我的调试器.

在单行运行速度更快的版本中,这是Main:

        SingleLineTest();
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  call        dword ptr ds:[0019380Ch] 
            MultiLineTest();
00000009  call        dword ptr ds:[00193818h] 
            SingleLineTest();
0000000f  call        dword ptr ds:[0019380Ch] 
            MultiLineTest();
00000015  call        dword ptr ds:[00193818h] 
            SingleLineTest();
0000001b  call        dword ptr ds:[0019380Ch] 
            MultiLineTest();
00000021  call        dword ptr ds:[00193818h] 
00000027  pop         ebp 
        }
00000028  ret 
Run Code Online (Sandbox Code Playgroud)

注意,它MultiLineTest已被放置在8字节边界上,并且SingleLineTest位于4字节边界上.

这是Main两个以相同速度运行的版本:

            MultiLineTest();
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  call        dword ptr ds:[00153818h] 

            SingleLineTest();
00000009  call        dword ptr ds:[0015380Ch] 
            MultiLineTest();
0000000f  call        dword ptr ds:[00153818h] 
            SingleLineTest();
00000015  call        dword ptr ds:[0015380Ch] 
            MultiLineTest();
0000001b  call        dword ptr ds:[00153818h] 
            SingleLineTest();
00000021  call        dword ptr ds:[0015380Ch] 
            MultiLineTest();
00000027  call        dword ptr ds:[00153818h] 
0000002d  pop         ebp 
        }
0000002e  ret 
Run Code Online (Sandbox Code Playgroud)

令人惊讶的是,JIT选择的地址在最后4位数字中是相同的,即使据称它们以相反的顺序处理它们.不确定我是否相信.

需要更多挖掘.我认为有人提到循环之前的代码在两个版本中并不完全相同?去调查.

这是"慢"版本SingleLineTest(我检查过,函数地址的最后几位没有改变).

            Stopwatch stopwatch = new Stopwatch();
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  push        edi 
00000004  push        esi 
00000005  push        ebx 
00000006  mov         ecx,7A5A2C68h 
0000000b  call        FFF91EA0 
00000010  mov         esi,eax 
00000012  mov         dword ptr [esi+4],0 
00000019  mov         dword ptr [esi+8],0 
00000020  mov         byte ptr [esi+14h],0 
00000024  mov         dword ptr [esi+0Ch],0 
0000002b  mov         dword ptr [esi+10h],0 
            stopwatch.Start();
00000032  cmp         byte ptr [esi+14h],0 
00000036  jne         00000047 
00000038  call        7A22B314 
0000003d  mov         dword ptr [esi+0Ch],eax 
00000040  mov         dword ptr [esi+10h],edx 
00000043  mov         byte ptr [esi+14h],1 
            int count = 0;
00000047  xor         edi,edi 
            for (uint i = 0; i < 1000000000; ++i) {
00000049  xor         edx,edx 
                count += i % 16 == 0 ? 1 : 0;
0000004b  mov         eax,edx 
0000004d  and         eax,0Fh 
00000050  test        eax,eax 
00000052  je          00000058 
00000054  xor         eax,eax 
00000056  jmp         0000005D 
00000058  mov         eax,1 
0000005d  add         edi,eax 
            for (uint i = 0; i < 1000000000; ++i) {
0000005f  inc         edx 
00000060  cmp         edx,3B9ACA00h 
00000066  jb          0000004B 
            }
            stopwatch.Stop();
00000068  mov         ecx,esi 
0000006a  call        7A23F2C0 
            Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
0000006f  mov         ecx,797C29B4h 
00000074  call        FFF91EA0 
00000079  mov         ecx,eax 
0000007b  mov         dword ptr [ecx+4],edi 
0000007e  mov         ebx,ecx 
00000080  mov         ecx,797BA240h 
00000085  call        FFF91EA0 
0000008a  mov         edi,eax 
0000008c  mov         ecx,esi 
0000008e  call        7A23ABE8 
00000093  push        edx 
00000094  push        eax 
00000095  push        0 
00000097  push        2710h 
0000009c  call        783247EC 
000000a1  mov         dword ptr [edi+4],eax 
000000a4  mov         dword ptr [edi+8],edx 
000000a7  mov         esi,edi 
000000a9  call        793C6F40 
000000ae  push        ebx 
000000af  push        esi 
000000b0  mov         ecx,eax 
000000b2  mov         edx,dword ptr ds:[03392034h] 
000000b8  mov         eax,dword ptr [ecx] 
000000ba  mov         eax,dword ptr [eax+3Ch] 
000000bd  call        dword ptr [eax+1Ch] 
000000c0  pop         ebx 
        }
000000c1  pop         esi 
000000c2  pop         edi 
000000c3  pop         ebp 
000000c4  ret 
Run Code Online (Sandbox Code Playgroud)

而"快速"版本:

            Stopwatch stopwatch = new Stopwatch();
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  push        edi 
00000004  push        esi 
00000005  push        ebx 
00000006  mov         ecx,7A5A2C68h 
0000000b  call        FFE11F70 
00000010  mov         esi,eax 
00000012  mov         ecx,esi 
00000014  call        7A1068BC 
            stopwatch.Start();
00000019  cmp         byte ptr [esi+14h],0 
0000001d  jne         0000002E 
0000001f  call        7A12B3E4 
00000024  mov         dword ptr [esi+0Ch],eax 
00000027  mov         dword ptr [esi+10h],edx 
0000002a  mov         byte ptr [esi+14h],1 
            int count = 0;
0000002e  xor         edi,edi 
            for (uint i = 0; i < 1000000000; ++i) {
00000030  xor         edx,edx 
                count += i % 16 == 0 ? 1 : 0;
00000032  mov         eax,edx 
00000034  and         eax,0Fh 
00000037  test        eax,eax 
00000039  je          0000003F 
0000003b  xor         eax,eax 
0000003d  jmp         00000044 
0000003f  mov         eax,1 
00000044  add         edi,eax 
            for (uint i = 0; i < 1000000000; ++i) {
00000046  inc         edx 
00000047  cmp         edx,3B9ACA00h 
0000004d  jb          00000032 
            }
            stopwatch.Stop();
0000004f  mov         ecx,esi 
00000051  call        7A13F390 
            Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
00000056  mov         ecx,797C29B4h 
0000005b  call        FFE11F70 
00000060  mov         ecx,eax 
00000062  mov         dword ptr [ecx+4],edi 
00000065  mov         ebx,ecx 
00000067  mov         ecx,797BA240h 
0000006c  call        FFE11F70 
00000071  mov         edi,eax 
00000073  mov         ecx,esi 
00000075  call        7A13ACB8 
0000007a  push        edx 
0000007b  push        eax 
0000007c  push        0 
0000007e  push        2710h 
00000083  call        782248BC 
00000088  mov         dword ptr [edi+4],eax 
0000008b  mov         dword ptr [edi+8],edx 
0000008e  mov         esi,edi 
00000090  call        792C7010 
00000095  push        ebx 
00000096  push        esi 
00000097  mov         ecx,eax 
00000099  mov         edx,dword ptr ds:[03562030h] 
0000009f  mov         eax,dword ptr [ecx] 
000000a1  mov         eax,dword ptr [eax+3Ch] 
000000a4  call        dword ptr [eax+1Ch] 
000000a7  pop         ebx 
        }
000000a8  pop         esi 
000000a9  pop         edi 
000000aa  pop         ebp 
000000ab  ret 
Run Code Online (Sandbox Code Playgroud)

只是循环,快速左侧,慢速右侧:

00000030  xor         edx,edx                 00000049  xor         edx,edx 
00000032  mov         eax,edx                 0000004b  mov         eax,edx 
00000034  and         eax,0Fh                 0000004d  and         eax,0Fh 
00000037  test        eax,eax                 00000050  test        eax,eax 
00000039  je          0000003F                00000052  je          00000058 
0000003b  xor         eax,eax                 00000054  xor         eax,eax 
0000003d  jmp         00000044                00000056  jmp         0000005D 
0000003f  mov         eax,1                   00000058  mov         eax,1 
00000044  add         edi,eax                 0000005d  add         edi,eax 
00000046  inc         edx                     0000005f  inc         edx 
00000047  cmp         edx,3B9ACA00h           00000060  cmp         edx,3B9ACA00h 
0000004d  jb          00000032                00000066  jb          0000004B 
Run Code Online (Sandbox Code Playgroud)

指令是相同的(相对跳转,即使反汇编显示不同的地址,机器代码也相同),但对齐方式不同.有三次跳跃.在je加载一个恒定1在缓慢的版本,而不是在快速的版本一致,但它并不重要,因为这跳只拍摄的时间的1/16.另外两次跳转(jmp加载一个常数零,jb重复整个循环后)需要数百万次,并在"快速"版本中对齐.

我认为这是吸烟枪.