访问5字节结构比8字节慢得多

Mic*_*hal 5 .net c# arrays performance benchmarking

我有一个代码,它根据另一个小数组中的值更新数组.

  for (var i = 0; i < result.Length; i++)
  {
    var c = cards[i];
    result[i] -= one[c.C0] + one[c.C1];
  }
Run Code Online (Sandbox Code Playgroud)

其中c是一个结构是一对从一甲板表示卡的字节. one是一个52的数组大小(从甲板上的52张牌中的每一张都有条目)

我写了一个基准来分析这段代码:

private void TestCards2(int testRepetitions, float[] result, float[] one, Cards[] cards)
{
  for (var r = 0; r < testRepetitions; r++)
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
}
Run Code Online (Sandbox Code Playgroud)

设置testRepetitions= 2500万,并使用256个元素(result.Length = 256)的数组,它在我的机器上运行大约8.5秒.

这是Cards结构:

struct Cards
{
  public byte C0;
  public byte C1;

  public Cards(byte c0, byte c1)
  {
    C0 = c0;
    C1 = c1;
  }
}
Run Code Online (Sandbox Code Playgroud)

当我修改该结构以容纳5张卡(5个字节)时,相同的基准测试现在需要大约13秒.为什么会这样?计算是相同的,剩余的3个卡未使用,并且所有阵列都足够小以适合L1缓存.

更奇怪的是,如果我进一步改变卡片现在保持8个字节,基准测试现在更快,需要~10秒.

我的设置:

VS 2015 Update 3.
.NET 4.6.2
Release Build x64
CPU: Haswell i7-5820K CPU @ 3.30GHz
Run Code Online (Sandbox Code Playgroud)

这是我得到的确切时间:

Test With 2 Cards. Time = 8582 ms
Test With 5 Cards. Time = 12910 ms
Test With 8 Cards. Time = 10180 ms
Run Code Online (Sandbox Code Playgroud)

这里发生了什么?

基准代码:

class TestAdjustment
  {
    public void Test()
    {
      using (Process p = Process.GetCurrentProcess())
        p.PriorityClass = ProcessPriorityClass.High;

      var size = 256;

      float[] one = ArrayUtils.CreateRandomFloatArray(size:52);
      int[] card0 = ArrayUtils.RandomIntArray(size, minValue:0, maxValueInclusive:51);
      int[] card1 = ArrayUtils.RandomIntArray(size, minValue: 0, maxValueInclusive: 51);

      Cards[] cards = CreateCardsArray(card0, card1);
      Cards5[] cards5 = CreateCards5Array(card0, card1);
      Cards8[] cards8 = CreateCards8Array(card0, card1);

      float[] result = ArrayUtils.CreateRandomFloatArray(size);
      float[] resultClone = result.ToArray(); 


      var testRepetitions = 25*1000*1000;

      var sw = Stopwatch.StartNew();


      TestCards2(testRepetitions, result, one, cards);
      WriteLine($"Test With 2 Cards. Time = {sw.ElapsedMilliseconds} ms");
      result = resultClone.ToArray(); //restore original array from the clone, so that next method works on the same data
      sw.Restart();


      TestCards5(testRepetitions, result, one, cards5);
      WriteLine($"Test With 5 Cards. Time = {sw.ElapsedMilliseconds} ms");
      result = resultClone.ToArray();
      sw.Restart();


      TestCards8(testRepetitions, result, one, cards8);
      WriteLine($"Test With 8 Cards. Time = {sw.ElapsedMilliseconds} ms");


    }



    private void TestCards2(int testRepetitions, float[] result, float[] one, Cards[] cards)
    {
      for (var r = 0; r < testRepetitions; r++)
        for (var i = 0; i < result.Length; i++)
        {
          var c = cards[i];
          result[i] -= one[c.C0] + one[c.C1];
        }
    }

    private void TestCards5(int testRepetitions, float[] result, float[] one, Cards5[] cards)
    {
      for (var r = 0; r < testRepetitions; r++)
        for (var i = 0; i < result.Length; i++)
        {
          var c = cards[i];
          result[i] -= one[c.C0] + one[c.C1];
        }
    }


    private void TestCards8(int testRepetitions, float[] result, float[] one, Cards8[] cards)
    {
      for (var r = 0; r < testRepetitions; r++)
        for (var i = 0; i < result.Length; i++)
        {
          var c = cards[i];
          result[i] -= one[c.C0] + one[c.C1];
        }
    }


    private Cards[] CreateCardsArray(int[] c0, int[] c1)
    {
      var result = new Cards[c0.Length];
      for (var i = 0; i < result.Length; i++)
        result[i] = new Cards((byte)c0[i], (byte)c1[i]);

      return result;
    }

    private Cards5[] CreateCards5Array(int[] c0, int[] c1)
    {
      var result = new Cards5[c0.Length];
      for (var i = 0; i < result.Length; i++)
        result[i] = new Cards5((byte)c0[i], (byte)c1[i]);

      return result;
    }

    private Cards8[] CreateCards8Array(int[] c0, int[] c1)
    {
      var result = new Cards8[c0.Length];
      for (var i = 0; i < result.Length; i++)
        result[i] = new Cards8((byte)c0[i], (byte)c1[i]);

      return result;
    }
  }


  struct Cards
  {
    public byte C0;
    public byte C1;

    public Cards(byte c0, byte c1)
    {
      C0 = c0;
      C1 = c1;
    }
  }

  struct Cards5
  {
    public byte C0, C1, C2, C3, C4;

    public Cards5(byte c0, byte c1)
    {
      C0 = c0;
      C1 = c1;
      C2 = C3 = C4 = 0;
    }
  }

  struct Cards8
  {
    public byte C0, C1, C2, C3, C4, C5, C6, C7;


    public Cards8(byte c0, byte c1)
    {
      C0 = c0;
      C1 = c1;
      C2 = C3 = C4 = C5 = C6 = C7 = 0;
    }
  }
Run Code Online (Sandbox Code Playgroud)

编辑 我再次重新运行基准测试,进行了1亿次迭代.结果如下:

Test With 5 Cards. Time = 52245 ms
Test With 8 Cards. Time = 40531 ms
Run Code Online (Sandbox Code Playgroud)

并按相反的顺序:

Test With 8 Cards. Time = 41041 ms
Test With 5 Cards. Time = 52034 ms
Run Code Online (Sandbox Code Playgroud)

在Surface Pro 4上运行它(Skylake i7-6650U Turbo-boosted到~3.4ghz):

Test With 8 Cards. Time = 47913 ms
Test With 5 Cards. Time = 55182 ms
Run Code Online (Sandbox Code Playgroud)

因此,差异仍然存在,并且不依赖于订单.

我还使用英特尔VTune运行分析,它显示0.3"5卡"版本和0.27"8卡"的CPI .

Edit2添加了ArrayUtils类,用于创建初始随机数组.

 public static class ArrayUtils
  {
    static Random rand = new Random(137);

    public static float[] CreateRandomFloatArray(int size)
    {
      var result = new float[size];
      for (int i = 0; i < size; i++)
        result[i] = (float) rand.NextDouble();

      return result;
    }

    public static int[] RandomIntArray(int size, int minValue, int maxValueInclusive)
    {
      var result = new int[size];
      for (int i = 0; i < size; i++)
        result[i] = rand.Next(minValue, maxValueInclusive + 1);

      return result;

    }
  }
Run Code Online (Sandbox Code Playgroud)

And*_*hin 19

这都是关于未对齐的内存访问.未对齐的内存就绪延迟大于对齐的内存读取延迟.要完成实验,让我们添加结构Cards3,Cards4等等.让我们来看看相应的数组如何在内存中表示.

在此输入图像描述

接下来,让我们改进您的基准.

  1. 我们将使用BenchmarkDotNet(该工具将执行许多基准测试程序,如预热,自动选择迭代量,统计计算等).
  2. 我们将为所有Cards2.. Cards8数组做基准测试,而不仅仅是其中的3个.
  3. 此外,我们将检查所有JIT编译器的完整.NET Framework(LegacyJIT-x86,LegacyJIT-x64,RyuJIT-x64)和Mono.

这是我的环境:

Host Process Environment Information:
BenchmarkDotNet.Core=v0.9.9.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4810MQ CPU 2.80GHz, ProcessorCount=8
Frequency=2728068 ticks, Resolution=366.5598 ns, Timer=TSC
CLR1=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
CLR2=Mono JIT compiler version 4.4.0, Arch=32-bit
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1080.0
Run Code Online (Sandbox Code Playgroud)

我的结果是:

 Method | Platform |       Jit | Toolchain | Runtime |    Median |    StdDev |
------- |--------- |---------- |---------- |-------- |---------- |---------- |
     C2 |     Host |      Host |      Mono |    Mono | 3.9230 ns | 0.0532 ns |
     C3 |     Host |      Host |      Mono |    Mono | 4.8223 ns | 0.0920 ns |
     C4 |     Host |      Host |      Mono |    Mono | 5.9149 ns | 0.1207 ns |
     C5 |     Host |      Host |      Mono |    Mono | 6.3981 ns | 0.0913 ns |
     C6 |     Host |      Host |      Mono |    Mono | 7.1179 ns | 0.1222 ns |
     C7 |     Host |      Host |      Mono |    Mono | 7.6318 ns | 0.1269 ns |
     C8 |     Host |      Host |      Mono |    Mono | 8.4650 ns | 0.1497 ns |
     C2 |      X64 | LegacyJit |      Host |    Host | 2.3515 ns | 0.0150 ns |
     C3 |      X64 | LegacyJit |      Host |    Host | 4.2553 ns | 0.0700 ns |
     C4 |      X64 | LegacyJit |      Host |    Host | 1.4366 ns | 0.0385 ns |
     C5 |      X64 | LegacyJit |      Host |    Host | 2.3688 ns | 0.0359 ns |
     C6 |      X64 | LegacyJit |      Host |    Host | 2.3684 ns | 0.0404 ns |
     C7 |      X64 | LegacyJit |      Host |    Host | 3.0404 ns | 0.0664 ns |
     C8 |      X64 | LegacyJit |      Host |    Host | 1.4510 ns | 0.0333 ns |
     C2 |      X64 |    RyuJit |      Host |    Host | 1.9281 ns | 0.0306 ns |
     C3 |      X64 |    RyuJit |      Host |    Host | 2.1183 ns | 0.0348 ns |
     C4 |      X64 |    RyuJit |      Host |    Host | 1.9395 ns | 0.0397 ns |
     C5 |      X64 |    RyuJit |      Host |    Host | 2.7706 ns | 0.0387 ns |
     C6 |      X64 |    RyuJit |      Host |    Host | 2.6471 ns | 0.0513 ns |
     C7 |      X64 |    RyuJit |      Host |    Host | 2.9743 ns | 0.0541 ns |
     C8 |      X64 |    RyuJit |      Host |    Host | 2.6280 ns | 0.1526 ns |
     C2 |      X86 | LegacyJit |      Host |    Host | 3.0854 ns | 0.2172 ns |
     C3 |      X86 | LegacyJit |      Host |    Host | 3.1627 ns | 0.1126 ns |
     C4 |      X86 | LegacyJit |      Host |    Host | 3.0577 ns | 0.0929 ns |
     C5 |      X86 | LegacyJit |      Host |    Host | 5.0957 ns | 0.1601 ns |
     C6 |      X86 | LegacyJit |      Host |    Host | 6.1723 ns | 0.1177 ns |
     C7 |      X86 | LegacyJit |      Host |    Host | 7.1155 ns | 0.0803 ns |
     C8 |      X86 | LegacyJit |      Host |    Host | 3.7703 ns | 0.1276 ns |
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

完整源代码:

using System;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Attributes.Exporters;
using BenchmarkDotNet.Attributes.Jobs;
using BenchmarkDotNet.Running;

[LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job, MonoJob]
[RPlotExporter]
public class CardBenchmarks
{
  public const int Size = 256;

  private float[] result, one;
  private Cards2[] cards2;
  private Cards3[] cards3;
  private Cards4[] cards4;
  private Cards5[] cards5;
  private Cards6[] cards6;
  private Cards7[] cards7;
  private Cards8[] cards8;

  [Setup]
  public void Setup()
  {
    result = ArrayUtils.CreateRandomFloatArray(Size);
    one = ArrayUtils.CreateRandomFloatArray(size: 52);
    var c0 = ArrayUtils.RandomByteArray(Size, minValue: 0, maxValueInclusive: 51);
    var c1 = ArrayUtils.RandomByteArray(Size, minValue: 0, maxValueInclusive: 51);

    cards2 = CardUtls.Create2(c0, c1);
    cards3 = CardUtls.Create3(c0, c1);
    cards4 = CardUtls.Create4(c0, c1);
    cards5 = CardUtls.Create5(c0, c1);
    cards6 = CardUtls.Create6(c0, c1);
    cards7 = CardUtls.Create7(c0, c1);
    cards8 = CardUtls.Create8(c0, c1);
  }

  [Benchmark(OperationsPerInvoke = Size)]
  public void C2()
  {
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards2[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
  }

  [Benchmark(OperationsPerInvoke = Size)]
  public void C3()
  {
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards3[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
  }

  [Benchmark(OperationsPerInvoke = Size)]
  public void C4()
  {
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards4[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
  }

  [Benchmark(OperationsPerInvoke = Size)]
  public void C5()
  {
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards5[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
  }

  [Benchmark(OperationsPerInvoke = Size)]
  public void C6()
  {
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards6[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
  }

  [Benchmark(OperationsPerInvoke = Size)]
  public void C7()
  {
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards7[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
  }

  [Benchmark(OperationsPerInvoke = Size)]
  public void C8()
  {
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards8[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
  }
}

public static class ArrayUtils
{
  private static readonly Random Rand = new Random(137);

  public static float[] CreateRandomFloatArray(int size)
  {
    var result = new float[size];
    for (int i = 0; i < size; i++)
      result[i] = (float) Rand.NextDouble();
    return result;
  }

  public static byte[] RandomByteArray(int size, int minValue, int maxValueInclusive)
  {
    var result = new byte[size];
    for (int i = 0; i < size; i++)
      result[i] = (byte) Rand.Next(minValue, maxValueInclusive + 1);
    return result;
  }
}

public static class CardUtls
{
  private static T[] Create<T>(int length, Func<int, T> create) => Enumerable.Range(0, length).Select(create).ToArray();

  public static Cards2[] Create2(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards2 {C0 = c0[i], C1 = c1[i]});
  public static Cards3[] Create3(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards3 {C0 = c0[i], C1 = c1[i]});
  public static Cards4[] Create4(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards4 {C0 = c0[i], C1 = c1[i]});
  public static Cards5[] Create5(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards5 {C0 = c0[i], C1 = c1[i]});
  public static Cards6[] Create6(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards6 {C0 = c0[i], C1 = c1[i]});
  public static Cards7[] Create7(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards7 {C0 = c0[i], C1 = c1[i]});
  public static Cards8[] Create8(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards8 {C0 = c0[i], C1 = c1[i]});
}

public struct Cards2
{
  public byte C0, C1;
}

public struct Cards3
{
  public byte C0, C1, C2;
}

public struct Cards4
{
  public byte C0, C1, C2, C3;
}

public struct Cards5
{
  public byte C0, C1, C2, C3, C4;
}

public struct Cards6
{
  public byte C0, C1, C2, C3, C4, C5;
}

public struct Cards7
{
  public byte C0, C1, C2, C3, C4, C5, C6;
}

public struct Cards8
{
  public byte C0, C1, C2, C3, C4, C5, C6, C7;
}


class Program
{
  static void Main()
  {
    BenchmarkRunner.Run<CardBenchmarks>();
  }
}
Run Code Online (Sandbox Code Playgroud)

回答

正如您所看到的,您的基准测试很难,有很多不同因素会影响您的性能.最重要的事情之一是运行时如何布局数据.例如,您不会在Mono上观察到所描述的行为,因为Mono和Full Framework具有不同的布局算法(在Mono上我们有Marshal.SizeOf(typeof(Card2)) == 8).

我们有Time(Card5) > Time(Card8)完整框架,因为Card5产生了许多不对齐的读取Card8(见第一张图片).

更新

评论中的问题:

知道为什么3字节在你的RyuJIT64基准测试中表现优于8字节吗?

我们来看看asm代码:

; RyuJIT-x64, C3
                var c = cards3[i];
00007FFEDF0CADCE  mov         r10,r9  
00007FFEDF0CADD1  mov         r11d,dword ptr [r10+8]  
00007FFEDF0CADD5  cmp         eax,r11d  
00007FFEDF0CADD8  jae         00007FFEDF0CAE44  
00007FFEDF0CADDA  movsxd      r11,eax  
00007FFEDF0CADDD  imul        r11,r11,3  
00007FFEDF0CADE1  lea         r10,[r10+r11+10h]  
00007FFEDF0CADE6  movzx       r11d,byte ptr [r10]          ; !!!
00007FFEDF0CADEA  movzx       r10d,byte ptr [r10+1]        ; !!!

; RyuJIT-x64, C8
                var c = cards8[i];
00007FFEDF0CAE8C  mov         rdx,qword ptr [rcx+48h]  
00007FFEDF0CAE90  mov         r8d,dword ptr [rdx+8]  
00007FFEDF0CAE94  cmp         eax,r8d  
00007FFEDF0CAE97  jae         00007FFEDF0CAF0C  
00007FFEDF0CAE99  movsxd      r8,eax  
00007FFEDF0CAE9C  mov         rdx,qword ptr [rdx+r8*8+10h] ; !!! 
00007FFEDF0CAEA1  mov         qword ptr [rsp+28h],rdx      ; !!!
Run Code Online (Sandbox Code Playgroud)

C3情况下,RyuJIT保持在目标字节r11d,r10d寄存器; 在这种C8情况下,RyuJIT将它们保持在堆栈(qword ptr [rsp+28h])上.解释:当前版本的RyuJIT(在我的例子中为v4.6.1080.0)对不超过4个字段的结构进行标量替换(参见coreclr#6839).因此,RyuJIT性能C5,C6,C7,和C8比的表现更差C2,C3,C4.注意:在RyuJIT的未来版本中可能会更改此行为.