单声道多处理性能问题

Dou*_*las 8 c# mono performance multithreading multiprocessing

在Mono上运行计算密集型多处理代码时,我遇到了严重的性能问题.下面的简单片段使用蒙特卡罗方法估计pi的值,证明了这个问题.

该程序产生许多线程,这些线程等于当前机器上的逻辑核心数,并对每个线程执行相同的计算.当使用.NET Framework 4.5在Windows 7的Intel Core i7笔记本电脑上运行时,整个过程在4.2秒内运行,并且线程各自执行时间之间的相对标准偏差为2%.

但是,当使用Mono 2.10.9在同一台机器(和操作系统)上运行时,总执行时间最长可达18秒.各个线程的性能之间存在巨大差异,最快的完成时间仅为5.6秒,最慢的时间为18秒.平均值为14 s,相对标准偏差为28%.

原因似乎不是线程调度.将每个线程固定到不同的核心(通过调用BeginThreadAffinitySetThreadAffinityMask)不会对线程的持续时间或差异产生任何显着影响.

类似地,在每个线程上多次运行计算(并单独计时)也会给出看似临时的持续时间.因此,问题似乎也不是由每个处理器的预热时间引起的.

我发现有所作为的是将所有8个线程固定到同一个处理器.在这种情况下,整体执行时间为25秒,这比在单个线程上执行8倍工作要慢1%.此外,相对标准偏差也降至1%以下.因此,问题不在于Mono的多线程本身,而在于它的多处理.

有没有人有解决方案如何解决这个性能问题?

static long limit = 1L << 26;

static long[] results;
static TimeSpan[] timesTaken;

internal static void Main(string[] args)
{
    int processorCount = Environment.ProcessorCount;

    Console.WriteLine("Thread count: " + processorCount);
    Console.WriteLine("Number of points per thread: " + limit.ToString("N0"));

    Thread[] threads = new Thread[processorCount];            
    results = new long[processorCount];
    timesTaken = new TimeSpan[processorCount];

    for (int i = 0; i < processorCount; ++i)
        threads[i] = new Thread(ComputeMonteCarloPi);

    Stopwatch stopwatch = Stopwatch.StartNew();

    for (int i = 0; i < processorCount; ++i)
        threads[i].Start(i);

    for (int i = 0; i < processorCount; ++i)
        threads[i].Join();

    stopwatch.Stop();

    double average = results.Average();
    double ratio = average / limit;
    double pi = ratio * 4;

    Console.WriteLine("Pi: " + pi);

    Console.WriteLine("Overall duration:   " + FormatTime(stopwatch.Elapsed));
    Console.WriteLine();

    for (int i = 0; i < processorCount; ++i)
        Console.WriteLine("Thread " + i.ToString().PadLeft(2, '0') + " duration: " + FormatTime(timesTaken[i]));

    Console.ReadKey();
}

static void ComputeMonteCarloPi(object o)
{
    int processorID = (int)o;

    Random random = new Random(0);
    Stopwatch stopwatch = Stopwatch.StartNew();

    long hits = SamplePoints(random);

    stopwatch.Stop();

    timesTaken[processorID] = stopwatch.Elapsed;
    results[processorID] = hits;
}

private static long SamplePoints(Random random)
{
    long hits = 0;

    for (long i = 0; i < limit; ++i)
    {
        double x = random.NextDouble() - 0.5;
        double y = random.NextDouble() - 0.5;

        if (x * x + y * y <= 0.25)
            hits++;
    }

    return hits;
}

static string FormatTime(TimeSpan time, int padLeft = 7)
{
    return time.TotalMilliseconds.ToString("N0").PadLeft(padLeft);
}
Run Code Online (Sandbox Code Playgroud)

.NET上的输出:

Thread count: 8
Number of points per thread: 67,108,864
Pi: 3.14145541191101
Overall duration:     4,234

Thread 00 duration:   4,199
Thread 01 duration:   3,987
Thread 02 duration:   4,002
Thread 03 duration:   4,032
Thread 04 duration:   3,956
Thread 05 duration:   3,980
Thread 06 duration:   4,036
Thread 07 duration:   4,160
Run Code Online (Sandbox Code Playgroud)

单声道输出:

Thread count: 8
Number of points per thread: 67,108,864
Pi: 3.14139330387115
Overall duration:    17,890

Thread 00 duration:  10,023
Thread 01 duration:  13,203
Thread 02 duration:  14,776
Thread 03 duration:  15,564
Thread 04 duration:  17,888
Thread 05 duration:  16,776
Thread 06 duration:  16,050
Thread 07 duration:   5,561
Run Code Online (Sandbox Code Playgroud)

单声道输出,所有线程固定到同一处理器:

Thread count: 8
Number of points per thread: 67,108,864
Pi: 3.14139330387115
Overall duration:    25,260

Thread 00 duration:  24,704
Thread 01 duration:  25,191
Thread 02 duration:  24,689
Thread 03 duration:  24,697
Thread 04 duration:  24,716
Thread 05 duration:  24,725
Thread 06 duration:  24,707
Thread 07 duration:  24,720
Run Code Online (Sandbox Code Playgroud)

Mono单线程输出:

Thread count: 1
Number of points per thread: 536,870,912
Pi: 3.14153660088778
Overall duration:    25,090
Run Code Online (Sandbox Code Playgroud)

Rei*_*nds 5

mono --gc=sgen按预期运行并为我修复它(使用Mono 3.0.10).

根本问题是Boehm垃圾收集器的线程局部分配在与类型分配或大块一起使用时需要一些特殊的调整.这不仅有些不重要,而且还有一些缺点:要么标记更复杂/更昂贵,要么每个线程和类型需要一个空闲列表(好吧,每个内存布局).

因此,默认情况下,Boehm GC仅支持完全无指针的内存区域或每个字可以作为指针的区域,最多可达256个字节左右.

但是,如果没有线程局部分配,每个分配都会获得一个全局锁,这将成为一个瓶颈.

SGen垃圾收集器是为Mono定制编写的,专门设计用于在多线程系统中快速工作,并且没有这些问题.