是否激活结构而不将其存储为局部变量期望比将其存储为局部变量更慢?

Mik*_*EEE 7 c# performance .net-core benchmarkdotnet c#-7.2

我在.NET Core 2.1中遇到了一个我想要了解的性能问题.可以在此处找到此代码:

https://github.com/mike-eee/StructureActivation

以下是BenchmarkDotNet的相关基准代码:

public class Program
{
    static void Main()
    {
        BenchmarkRunner.Run<Program>();
    }

    [Benchmark(Baseline = true)]
    public uint? Activated() => new Structure(100).SomeValue;

    [Benchmark]
    public uint? ActivatedAssignment()
    {
        var selection = new Structure(100);
        return selection.SomeValue;
    }
}

public readonly struct Structure
{
    public Structure(uint? someValue) => SomeValue = someValue;

    public uint? SomeValue { get; }
}
Run Code Online (Sandbox Code Playgroud)

从一开始,我希望Activated更快,因为它没有存储在本地变量,这是我一直理解为对性能产生负面定位和当前栈框架内预留的空间这样做.

但是,在运行测试时,我得到以下结果:

// * Summary *

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17134.285 (1803/April2018Update/Redstone4)
Intel Core i7-4820K CPU 3.70GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=2.1.402
  [Host]     : .NET Core 2.1.4 (CoreCLR 4.6.26814.03, CoreFX 4.6.26814.02), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.4 (CoreCLR 4.6.26814.03, CoreFX 4.6.26814.02), 64bit RyuJIT


              Method |     Mean |     Error |    StdDev | Scaled |
-------------------- |---------:|----------:|----------:|-------:|
           Activated | 4.700 ns | 0.0128 ns | 0.0107 ns |   1.00 |
 ActivatedAssignment | 3.331 ns | 0.0278 ns | 0.0260 ns |   0.71 |
Run Code Online (Sandbox Code Playgroud)

激活的结构(不存储局部变量)大约 30%.

作为参考,这里是IL礼貌的ReSharper的IL Viewer:

.method /*06000002*/ public hidebysig instance valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32> 
  Activated() cil managed 
{
  .custom /*0C00000C*/ instance void [BenchmarkDotNet/*23000002*/]BenchmarkDotNet.Attributes.BenchmarkAttribute/*0100000D*/::.ctor() 
    = (01 00 01 00 54 02 08 42 61 73 65 6c 69 6e 65 01 ) // ....T..Baseline.
    // property bool 'Baseline' = bool(true)
  .maxstack 1
  .locals /*11000001*/ init (
    [0] valuetype StructureActivation.Structure/*02000003*/ V_0
  )

  // [14 31 - 14 59]
  IL_0000: ldc.i4.s     100 // 0x64
  IL_0002: newobj       instance void valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32>/*1B000001*/::.ctor(!0/*unsigned int32*/)/*0A00000F*/
  IL_0007: newobj       instance void StructureActivation.Structure/*02000003*/::.ctor(valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32>)/*06000005*/
  IL_000c: stloc.0      // V_0
  IL_000d: ldloca.s     V_0
  IL_000f: call         instance valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32> StructureActivation.Structure/*02000003*/::get_SomeValue()/*06000006*/
  IL_0014: ret          

} // end of method Program::Activated

.method /*06000003*/ public hidebysig instance valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32> 
  ActivatedAssignment() cil managed 
{
  .custom /*0C00000D*/ instance void [BenchmarkDotNet/*23000002*/]BenchmarkDotNet.Attributes.BenchmarkAttribute/*0100000D*/::.ctor() 
    = (01 00 00 00 )
  .maxstack 2
  .locals /*11000001*/ init (
    [0] valuetype StructureActivation.Structure/*02000003*/ selection
  )

  // [19 4 - 19 39]
  IL_0000: ldloca.s     selection
  IL_0002: ldc.i4.s     100 // 0x64
  IL_0004: newobj       instance void valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32>/*1B000001*/::.ctor(!0/*unsigned int32*/)/*0A00000F*/
  IL_0009: call         instance void StructureActivation.Structure/*02000003*/::.ctor(valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32>)/*06000005*/

  // [20 4 - 20 31]
  IL_000e: ldloca.s     selection
  IL_0010: call         instance valuetype [System.Runtime/*23000001*/]System.Nullable`1/*0100000E*/<unsigned int32> StructureActivation.Structure/*02000003*/::get_SomeValue()/*06000006*/
  IL_0015: ret          

} // end of method Program::ActivatedAssignment
Run Code Online (Sandbox Code Playgroud)

在检查时,Activated有两个,newobjActivatedAssignment只有一个,这可能有助于两个基准之间的差异.

我的问题是:这是预期的吗?我试图理解为什么具有较少代码的基准实际上比具有更多代码的基准慢.我们将非常感谢任何确保遵循最佳做法的指导/建议.

sau*_*rol 5

如果您从方法中查看JITted程序集,那会更清楚发生了什么:

Program.Activated()
L0000: sub rsp, 0x18
L0004: xor eax, eax              // Initialize Structure to {0}
L0006: mov [rsp+0x10], rax       // Store to stack
L000b: mov eax, 0x64             // Load literal 100
L0010: mov edx, 0x1              // Load literal 1
L0015: xor ecx, ecx              // Initialize SomeValue to {0}
L0017: mov [rsp+0x8], rcx        // Store to stack
L001c: lea rcx, [rsp+0x8]        // Load pointer to SomeValue from stack
L0021: mov [rcx], dl             // Set SomeValue.HasValue to 1
L0023: mov [rcx+0x4], eax        // Set SomeValue.Value to 100
L0026: mov rax, [rsp+0x8]        // Load SomeValue's value from stack
L002b: mov [rsp+0x10], rax       // Store it to a different location on stack
L0030: mov rax, [rsp+0x10]       // Return it from that location
L0035: add rsp, 0x18
L0039: ret

Program.ActivatedAssignment()
L0000: push rax
L0001: xor eax, eax              // Initialize SomeValue to {0}
L0003: mov [rsp], rax            // Store to stack
L0007: mov eax, 0x64             // Load literal 100
L000c: mov edx, 0x1              // Load literal 1
L0011: lea rcx, [rsp]            // Load pointer to SomeValue from stack
L0015: mov [rcx], dl             // Set SomeValue.HasValue to 1
L0017: mov [rcx+0x4], eax        // Set SomeValue.Value to 100
L001a: mov rax, [rsp]            // Return SomeValue
L001e: add rsp, 0x8
L0022: ret
Run Code Online (Sandbox Code Playgroud)

显然,Activated()正在做更多的工作,这就是为什么它速度较慢。归结为很多堆栈改组(所有对的引用rsp)。我已尽力评论了它们,但是Activated()由于冗余movs ,所以该方法有些复杂。 ActivatedAssigment()更简单。

最终,您实际上并没有通过省略局部变量来节省堆栈空间。无论您是否给它命名,该变量都必须存在。您粘贴的IL代码显示了一个局部变量(它们称为V_0),它是C#编译器创建的临时变量,因为您没有显式创建它。

两者的不同之处在于,带有temp变量的版本仅保留单个堆栈插槽(.maxstack 1),并且将其同时用于Nullable<T>Structure,因此进行混洗。在具有命名变量的版本中,它保留2个插槽(.maxstack 2)。

具有讽刺意味的是,在具有预保留的局部变量for的版本中selection,JIT能够消除外部结构,仅处理其嵌入式结构Nullable<T>,从而使代码更干净/更快。

我不确定您可以从此示例中推断出任何最佳做法,但我认为很容易看出C#编译器是性能差异的源头。JIT足够聪明,可以对您的结构执行正确的操作,但前提是必须采用某种特定的方式。

  • 下面是我用你的样品的组装链接:https://sharplab.io/#v2:EYLgHgbALANALiAhgZwLYB8ACAGABJgRgG4BYAKEwGZ8AmfAgdnIG9zd36J8pcBZRAJYA7ABQBKNh1ZkOs3ADdEAJ1wAHXAF5cQgKYB3eg3GkZc9qoB0AQQDGcAYrg6AJsclnLt+45dXkyAQBzIVQdITg3U3YAX3J3fGoAV2E4AH5cLwdEJ1cxTQA+XBFdAwBlOCVEu0SlHRECbGwxMQtSgHtQgDVEABtEnRN4qlxk8PTMn2c/AODQ8PF46TN2RRVkHR6dOwE2oU1tfVxyyura+saxE2X8Blx1ze3d1o6dbr6B+NiyL/Jh2sRnLsegBPO4VKpwI7g046FhDajHCE1OqjNJ3F5vfp5DSFdpdXr9fbIDEEj5keEjFLpPGvUm4Zi4QI6OBEXBfaJAA= (2认同)