在C#中内联方法的成本

Question

在C#中内联方法的成本

Rai*_*ree 15 c# performance inline quicksort

我最近在C#中实现了QuickSort算法.对包含数百万个项目的整数数组进行排序,代码的性能比.NET的实现大约低10%.

private static void QS(int[] arr, int left, int right)
{
    if (left >= right) return;

    var pIndex = Partition(arr, left, right);
    QS( arr, left, pIndex);
    QS( arr, pIndex + 1, right);
}

Run Code Online (Sandbox Code Playgroud)

在包含500万个项目的数组中,此代码比.NET慢大约60ms.

随后,我创建了另一个方法,该Partition()方法具有内联方法QS()(消除方法调用和return语句).然而,这导致性能下降到比.NET的排序方法慢约250ms.

为什么会这样？

编辑:这是该Partition()方法的代码.在内联版本中QS(),除了return语句之外,此方法的全部内容都替换了该var pIndex = Partition(arr, left, right);行.

private static int Partition(int[] arr, int left, int right)
{
    int pivot = arr[left];
    int leftPoint = left - 1;
    int pIndex = right + 1;
    int temp = 0;

    while (true)
    {
        do { pIndex--; } while (arr[pIndex] > pivot);
        do { leftPoint++; } while (arr[leftPoint] < pivot);

        if (leftPoint < pIndex)
        {
            temp = arr[leftPoint];
            arr[leftPoint] = arr[pIndex];
            arr[pIndex] = temp;
        }
        else { break; }
    }
    return pIndex;
}

Run Code Online (Sandbox Code Playgroud)

编辑#2: 如果有人对编译感兴趣,这里是调用算法的代码:

编辑#3: 来自Haymo建议的新测试代码.

private static void Main(string[] args)
{
    const int globalRuns = 10;
    const int localRuns = 1000;

    var source = Enumerable.Range(1, 200000).OrderBy(n => Guid.NewGuid()).ToArray();
    var a = new int[source.Length];

    int start, end, total;

    for (int z = 0; z < globalRuns; z++)
    {
        Console.WriteLine("Run #{0}", z+1);

        total = 0;
        for (int i = 0; i < localRuns; i++)
        {
            Array.Copy(source, a, source.Length);
            start = Environment.TickCount;
            Array.Sort(a);
            end = Environment.TickCount;
            total += end - start;
        }
        Console.WriteLine("{0}\t\tTtl: {1}ms\tAvg: {2}ms", ".NET", total, total / localRuns);

        total = 0;
        for (int i = 0; i < localRuns; i++)
        {
            Array.Copy(source, a, source.Length);
            start = Environment.TickCount;
            Quicksort.SortInline(a);
            end = Environment.TickCount;
            total += end - start;
        }
        Console.WriteLine("{0}\t\tTtl: {1}ms\tAvg: {2}ms", "Inlined", total, total / localRuns);

        total = 0;
        for (int i = 0; i < localRuns; i++)
        {
            Array.Copy(source, a, source.Length);
            start = Environment.TickCount;
            Quicksort.SortNonInline(a);
            end = Environment.TickCount;
            total += end - start;
        }
        Console.WriteLine("{0}\tTtl: {1}ms\tAvg: {2}ms\n", "Not inlined", total, total / localRuns);
    }
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

Hay*_*ach 13

根据问题中给出的信息,人们只能猜测和命名一些想法.

你测量得对吗？请记住,为了获得可靠的性能结果,应该(至少)

运行发布版本而不附加调试器
经常重复测试并平均结果
使测试公平,即.给所有测试对象提供相同的"资源配置"

为了确保,(假设的)性能下降的来源实际上与函数内联相关联,可以调查生成的IL代码.甚至更好:JIT编译器生成的机器指令.

对于ILNumerics,我们实施了自定义快速排序并执行了许多性能测量.最终算法比CLR版本快几倍.手动内联只是一项改进,这对于提高性能是必要的.其他人是:

不使用递归
使用不安全的比较/交换
对较小的分区使用插入排序
优化有限大小的临时(堆栈替换)阵列

通常情况下,会发现奇怪的性能结果的来源,算法会使用(错误)内存.另一个可能存在于不同的指令流中,最终或多或少地成功地被所涉及的任何编译器/处理器优化.整体执行性能是一个非常复杂的野兽,很难确定性地猜测,因此探查器是你最好的朋友!

@Edit:通过查看主要测试例程,您可能主要测量处理器/主存储器的内存带宽.长度为5*10e6的int []数组大小约为19 MB.这很可能超出了您的缓存范围.因此,由于大多数时候强制缓存未命中,处理器将等待内存.这使得衡量任何代码重构的影响难以猜测.我建议尝试以这种方式测量:(伪代码)

生成测试数据
为副本分配数组
迭代全局重复次数(比方说10)
- Array.Sort的内部重复(例如1000)
  - 复制(未排序的)测试数据
  - 通过Array.Sort 对副本进行排序
  - 增加时间
- Array.Sort的平均时间
- Quicksort.Sort的内部重复(例如1000)
  - 复制(未排序的)测试数据
  - 通过Quicksort.Sort 对副本进行排序
  - 增加时间
- Quicksort.Sort的平均时间
- Quicksort.Sort2的内部重复(例如1000)
  - 复制(未排序的)测试数据
  - 通过Quicksort.Sort2 对副本进行排序
  - 增加时间
- Quicksort.Sort2的平均时间

目标是使快速排序仅使用来自缓存的数据.因此,请确保不要从新内存重新创建副本,而是只有两个全局实例:原始实例和要排序的副本.两者必须同时适合您的(最后一级)缓存!有了一些空间(对于系统上的其他进程),一个很好的猜测是只使用两个阵列的可用最后一级高速缓存大小的一半.根据您的真实缓存大小,250k的测试长度似乎更合理.

@ Edit2:我运行了你的代码,得到了相同的结果,并在VS调试器中观察了(优化的)机器指令.以下是两个版本的相关部分:

Not inlined: 
    69:                 do { pIndex--; } while (arr[pIndex] > pivot);
00000017  dec         ebx 
00000018  cmp         ebx,esi 
0000001a  jae         00000053 
0000001c  cmp         dword ptr [ecx+ebx*4+8],edi 
00000020  jg          00000017 
    70:                 do { leftPoint++; } while (arr[leftPoint] < pivot);
00000022  inc         edx 
00000023  cmp         edx,esi 
00000025  jae         00000053 
00000027  cmp         dword ptr [ecx+edx*4+8],edi 
0000002b  jl          00000022 


Inlined: 
    97:                 do { pIndex--; } while (arr[pIndex] > pivot);
00000038  dec         dword ptr [ebp-14h] 
0000003b  mov         eax,dword ptr [ebp-14h] 
0000003e  cmp         eax,edi 
00000040  jae         00000097 
00000042  cmp         dword ptr [esi+eax*4+8],ebx 
00000046  jg          00000038 
    98:                 do { leftPoint++; } while (arr[leftPoint] < pivot);
00000048  inc         ecx 
00000049  cmp         ecx,edi 
0000004b  jae         00000097 
0000004d  cmp         dword ptr [esi+ecx*4+8],ebx 
00000051  jl          00000048

Run Code Online (Sandbox Code Playgroud)

可以看出,'Not inlined'版本更好地利用寄存器来递减运行索引(第69/97行).显然JIT决定不推送和弹出堆栈上的相应寄存器,因为同一函数中的其他代码正在使用相同的寄存器.由于这是一个热循环(并且CLR无法识别这一点),因此整体执行速度会受到影响.因此,在这种特定情况下,手动内联分区功能是无利可图的.

但是,如您所知,无法保证其他版本的CLR也会这样做.对于64位,甚至可能出现差异.

归档时间：	13 年，9 月前
查看次数：	631 次
最近记录：	9 年，4 月前