tro*_*tro 44 .net c# parallel-extensions
我有一个关于并行循环的问题.我有以下代码:
public static void MultiplicateArray(double[] array, double factor)
{
for (int i = 0; i < array.Length; i++)
{
array[i] = array[i] * factor;
}
}
public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
{
for (int i = 0; i < arrayToChange.Length; i++)
{
arrayToChange[i] = arrayToChange[i] * multiplication[i];
}
}
public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
{
for (int i = 0; i < arrayToChange.Length; i++)
{
arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
}
}
Run Code Online (Sandbox Code Playgroud)
现在我尝试添加并行功能:
public static void MultiplicateArray(double[] array, double factor)
{
Parallel.For(0, array.Length, i =>
{
array[i] = array[i] * factor;
});
}
public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
{
Parallel.For(0, arrayToChange.Length, i =>
{
arrayToChange[i] = arrayToChange[i] * multiplication[i];
});
}
public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
{
Parallel.For(0, arrayToChange.Length, i =>
{
arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
});
}
Run Code Online (Sandbox Code Playgroud)
问题是,我想节省时间,而不是浪费时间.使用标准for循环,它计算大约2分钟,但使用并行for循环需要3分钟.为什么?
svi*_*ick 60
Parallel.For()通过并行化代码可以大大提高性能,但它也有开销(线程之间的同步,在每次迭代时调用委托).因为在你的代码中,每次迭代都很短(基本上只是几条CPU指令),这种开销会变得很突出.
因此,我认为使用Parallel.For()对您来说不是正确的解决方案.相反,如果您手动并行化代码(在这种情况下非常简单),您可能会看到性能提高.
为了验证这一点,我进行了一些测量:我在MultiplicateArray()200万个项目的数组上运行了不同的实现(我使用的代码如下).在我的机器上,串行版本一直需要0.21秒,Parallel.For()通常需要大约0.45秒,但有时它会飙升到8-9秒!
首先,我会尝试改进常见情况,稍后我会来到这些峰值.我们想要通过N个 CPU 处理数组,因此我们将它分成N个相同大小的部分并分别处理每个部分.结果?0.35秒 这仍然比串行版本更糟糕.但for循环遍历数组中的每个项目是最优化的结构之一.我们不能帮助编译器吗?提取计算循环的边界可能会有所帮助.事实证明它确实:0.18秒.这比串行版更好,但不是很多.而且,有趣的是,在我的4核机器上没有将并行度从4改为2(没有HyperThreading)并没有改变结果:仍然是0.18秒.这让我得出结论,CPU不是这里的瓶颈,内存带宽是.
现在,回到峰值:我的自定义并行化没有它们,但是Parallel.For(),为什么呢?Parallel.For()确实使用范围分区,这意味着每个线程处理它自己的数组部分.但是,如果一个线程提前完成,它将尝试帮助处理尚未完成的另一个线程的范围.如果发生这种情况,您将获得大量的虚假共享,这可能会大大减慢代码速度.我自己的强制虚假分享测试似乎表明这确实是问题所在.强迫平行度Parallel.For()似乎有助于尖峰.
当然,所有这些测量都是特定于我的计算机上的硬件,并且对您而言会有所不同,因此您应该自己进行测量.
我用过的代码:
static void Main()
{
double[] array = new double[200 * 1000 * 1000];
for (int i = 0; i < array.Length; i++)
array[i] = 1;
for (int i = 0; i < 5; i++)
{
Stopwatch sw = Stopwatch.StartNew();
Serial(array, 2);
Console.WriteLine("Serial: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
ParallelFor(array, 2);
Console.WriteLine("Parallel.For: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
ParallelForDegreeOfParallelism(array, 2);
Console.WriteLine("Parallel.For (degree of parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallel(array, 2);
Console.WriteLine("Custom parallel: {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelExtractedMax(array, 2);
Console.WriteLine("Custom parallel (extracted max): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelExtractedMaxHalfParallelism(array, 2);
Console.WriteLine("Custom parallel (extracted max, half parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);
sw = Stopwatch.StartNew();
CustomParallelFalseSharing(array, 2);
Console.WriteLine("Custom parallel (false sharing): {0:f2} s", sw.Elapsed.TotalSeconds);
}
}
static void Serial(double[] array, double factor)
{
for (int i = 0; i < array.Length; i++)
{
array[i] = array[i] * factor;
}
}
static void ParallelFor(double[] array, double factor)
{
Parallel.For(
0, array.Length, i => { array[i] = array[i] * factor; });
}
static void ParallelForDegreeOfParallelism(double[] array, double factor)
{
Parallel.For(
0, array.Length, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
i => { array[i] = array[i] * factor; });
}
static void CustomParallel(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
i < array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelExtractedMax(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
i < max;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelExtractedMaxHalfParallelism(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount / 2;
var tasks = new Task[degreeOfParallelism];
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
// capturing taskNumber in lambda wouldn't work correctly
int taskNumberCopy = taskNumber;
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
i < max;
i++)
{
array[i] = array[i] * factor;
}
});
}
Task.WaitAll(tasks);
}
static void CustomParallelFalseSharing(double[] array, double factor)
{
var degreeOfParallelism = Environment.ProcessorCount;
var tasks = new Task[degreeOfParallelism];
int i = -1;
for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
{
tasks[taskNumber] = Task.Factory.StartNew(
() =>
{
int j = Interlocked.Increment(ref i);
while (j < array.Length)
{
array[j] = array[j] * factor;
j = Interlocked.Increment(ref i);
}
});
}
Task.WaitAll(tasks);
}
Run Code Online (Sandbox Code Playgroud)
示例输出:
Serial: 0,20 s
Parallel.For: 0,50 s
Parallel.For (degree of parallelism): 8,90 s
Custom parallel: 0,33 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,53 s
Serial: 0,21 s
Parallel.For: 0,52 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,19 s
Custom parallel (false sharing): 7,59 s
Serial: 0,21 s
Parallel.For: 11,21 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,32 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,76 s
Serial: 0,21 s
Parallel.For: 0,46 s
Parallel.For (degree of parallelism): 0,35 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Serial: 0,21 s
Parallel.For: 0,45 s
Parallel.For (degree of parallelism): 0,40 s
Custom parallel: 0,38 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Run Code Online (Sandbox Code Playgroud)
Rom*_*ner 17
Svick已经提供了一个很好的答案,但我想强调的是,关键点不是"手动并行化代码"而不是使用,Parallel.For()而是必须处理更大的数据块.
这仍然可以Parallel.For()像这样使用:
static void My(double[] array, double factor)
{
int degreeOfParallelism = Environment.ProcessorCount;
Parallel.For(0, degreeOfParallelism, workerId =>
{
var max = array.Length * (workerId + 1) / degreeOfParallelism;
for (int i = array.Length * workerId / degreeOfParallelism; i < max; i++)
array[i] = array[i] * factor;
});
}
Run Code Online (Sandbox Code Playgroud)
它与svicks做同样的事情,CustomParallelExtractedMax()但更短,更简单,并且(在我的机器上)表现得更快:
Serial: 3,94 s
Parallel.For: 9,28 s
Parallel.For (degree of parallelism): 9,58 s
Custom parallel: 2,05 s
Custom parallel (extracted max): 1,19 s
Custom parallel (extracted max, half parallelism): 1,49 s
Custom parallel (false sharing): 27,88 s
My: 0,95 s
Run Code Online (Sandbox Code Playgroud)
顺便说一句,所有其他答案中缺少的关键字是粒度.
在
For循环中,循环体作为委托提供给方法.调用该委托的成本与虚拟方法调用大致相同.在某些情况下,并行循环的主体可能足够小,以致每次循环迭代的委托调用成本变得很高.在这种情况下,您可以使用其中一个Create重载来创建IEnumerable<T>源元素的范围分区.然后,您可以将此范围集合传递给ForEach其主体由常规for循环组成的方法.这种方法的好处是委托调用成本每个范围只发生一次,而不是每个元素发生一次.
在循环体中,您正在执行单个乘法,并且委托调用的开销将非常明显.
试试这个:
public static void MultiplicateArray(double[] array, double factor)
{
var rangePartitioner = Partitioner.Create(0, array.Length);
Parallel.ForEach(rangePartitioner, range =>
{
for (int i = range.Item1; i < range.Item2; i++)
{
array[i] = array[i] * factor;
}
});
}
Run Code Online (Sandbox Code Playgroud)
另请参见:Parallel.ForEach文档和Partitioner.Create文档.
Parallel.For涉及更复杂的内存管理.结果可能会有所不同,具体取决于cpu规范,如#cores,L1和L2缓存......
请看一下这篇有趣的文章:
http://msdn.microsoft.com/en-us/magazine/cc872851.aspx
| 归档时间: |
|
| 查看次数: |
78475 次 |
| 最近记录: |