Pep*_*Pep 18 delphi parallel-processing rtl-ppl
我正在尝试Delphi XE7 Update 1的并行编程功能.
我创建了一个简单的TParallel.For
循环,基本上做了一些虚假的操作来消磨时间.
我在AWS实例(c4.8xlarge)上的36 vCPU上启动了该程序,试图了解并行编程的优势.
当我第一次启动程序并执行TParallel.For
循环时,我看到了显着的增益(虽然admitelly比我预期的36个vCPU少很多):
Parallel matches: 23077072 in 242ms
Single Threaded matches: 23077072 in 2314ms
Run Code Online (Sandbox Code Playgroud)
如果我不关闭程序并在不久之后再次在36 vCPU机器上运行传递(例如,立即或大约10-20秒后),并行传递会恶化很多:
Parallel matches: 23077169 in 2322ms
Single Threaded matches: 23077169 in 2316ms
Run Code Online (Sandbox Code Playgroud)
如果我没有关闭程序并等待几分钟(不是几秒钟,但几分钟)再次运行传递之前,我再次得到第一次启动程序时得到的结果(响应时间提高了10倍) .
在36个vCPU机器上启动程序后的第一次传递总是更快,因此看起来这种效果仅TParallel.For
在程序中第二次调用a时发生.
这是我正在运行的示例代码:
unit ParallelTests;
interface
uses
Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
System.Threading, System.SyncObjs, System.Diagnostics,
Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls;
type
TForm1 = class(TForm)
Button1: TButton;
Memo1: TMemo;
SingleThreadCheckBox: TCheckBox;
ParallelCheckBox: TCheckBox;
UnitsEdit: TEdit;
Label1: TLabel;
procedure Button1Click(Sender: TObject);
private
{ Private declarations }
public
{ Public declarations }
end;
var
Form1: TForm1;
implementation
{$R *.dfm}
procedure TForm1.Button1Click(Sender: TObject);
var
matches: integer;
i,j: integer;
sw: TStopWatch;
maxItems: integer;
referenceStr: string;
begin
sw := TStopWatch.Create;
maxItems := 5000;
Randomize;
SetLength(referenceStr,120000); for i := 1 to 120000 do referenceStr[i] := Chr(Ord('a') + Random(26));
if ParallelCheckBox.Checked then begin
matches := 0;
sw.Reset;
sw.Start;
TParallel.For(1, MaxItems,
procedure (Value: Integer)
var
index: integer;
found: integer;
begin
found := 0;
for index := 1 to length(referenceStr) do begin
if (((Value mod 26) + ord('a')) = ord(referenceStr[index])) then begin
inc(found);
end;
end;
TInterlocked.Add(matches, found);
end);
sw.Stop;
Memo1.Lines.Add('Parallel matches: ' + IntToStr(matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
end;
if SingleThreadCheckBox.Checked then begin
matches := 0;
sw.Reset;
sw.Start;
for i := 1 to MaxItems do begin
for j := 1 to length(referenceStr) do begin
if (((i mod 26) + ord('a')) = ord(referenceStr[j])) then begin
inc(matches);
end;
end;
end;
sw.Stop;
Memo1.Lines.Add('Single Threaded matches: ' + IntToStr(Matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
end;
end;
end.
Run Code Online (Sandbox Code Playgroud)
这是按设计工作的吗?我发现这篇文章(http://delphiaball.co.uk/tag/parallel-programming/)建议我让图书馆决定线程池,但如果我不得不等待几分钟我就不会看到使用并行编程的重点从请求到请求,以便更快地提供请求.
我错过了关于如何使用TParallel.For
循环的任何内容吗?
请注意,我无法在AWS m3.large实例(根据AWS的2个vCPU)上重现这一点.在那种情况下,我总是会有轻微的改进,而且在接下来的电话中我不会得到更糟糕的结果TParallel.For
.
Parallel matches: 23077054 in 2057ms
Single Threaded matches: 23077054 in 2900ms
Run Code Online (Sandbox Code Playgroud)
因此,当有许多可用内核(36)时,似乎会出现这种影响,这很可惜,因为并行编程的整个要点是要从许多内核中受益.我想知道这是一个库错误,因为核心数量很多,或者在这种情况下核心数不是2的幂.
更新:在AWS中使用不同vCPU计数的各种实例对其进行测试后,这似乎是行为:
- 36个vCPU(c4.8xlarge).您必须在后续调用vanilla TParallel调用之间等待几分钟(这使得它无法用于生产)
- 32个vCPU(c3.8xlarge).您必须在后续调用vanilla TParallel调用之间等待几分钟(这使得它无法用于生产)
- 16个vCPU(c3.4xlarge).你必须等二次.如果负载低但响应时间仍然很重要,它可以使用
- 8个vCPU(c3.2xlarge).它似乎正常工作
- 4个vCPU(c3.xlarge).它似乎正常工作
- 2个vCPU(m3.large).它似乎正常工作
Dav*_*nan 15
我创建了两个基于你的测试程序来比较System.Threading
和OTL
.我使用XE7更新1和OTL r1397构建.我使用的OTL源对应于3.04版.我使用32位Windows编译器构建,使用发布版本选项.
我的测试机器是运行Windows 7 x64的双Intel Xeon E5530.该系统有两个四核处理器.总共有8个处理器,但系统表示由于超线程而有16个处理器.经验告诉我,超线程只是营销方式,我从未在这台机器上看到超过8倍的扩展.
现在两个程序几乎完全相同.
的System.Threading
program SystemThreadingTest;
{$APPTYPE CONSOLE}
uses
System.Diagnostics,
System.Threading;
const
maxItems = 5000;
DataSize = 100000;
procedure DoTest;
var
matches: integer;
i, j: integer;
sw: TStopWatch;
referenceStr: string;
begin
Randomize;
SetLength(referenceStr, DataSize);
for i := low(referenceStr) to high(referenceStr) do
referenceStr[i] := Chr(Ord('a') + Random(26));
// parallel
matches := 0;
sw := TStopWatch.StartNew;
TParallel.For(1, maxItems,
procedure(Value: integer)
var
index: integer;
found: integer;
begin
found := 0;
for index := low(referenceStr) to high(referenceStr) do
if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
inc(found);
AtomicIncrement(matches, found);
end);
Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
// serial
matches := 0;
sw := TStopWatch.StartNew;
for i := 1 to maxItems do
for j := low(referenceStr) to high(referenceStr) do
if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
inc(matches);
Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;
begin
while True do
DoTest;
end.
Run Code Online (Sandbox Code Playgroud)
OTL
program OTLTest;
{$APPTYPE CONSOLE}
uses
Winapi.Windows,
Winapi.Messages,
System.Diagnostics,
OtlParallel;
const
maxItems = 5000;
DataSize = 100000;
procedure ProcessThreadMessages;
var
msg: TMsg;
begin
while PeekMessage(Msg, 0, 0, 0, PM_REMOVE) and (Msg.Message <> WM_QUIT) do begin
TranslateMessage(Msg);
DispatchMessage(Msg);
end;
end;
procedure DoTest;
var
matches: integer;
i, j: integer;
sw: TStopWatch;
referenceStr: string;
begin
Randomize;
SetLength(referenceStr, DataSize);
for i := low(referenceStr) to high(referenceStr) do
referenceStr[i] := Chr(Ord('a') + Random(26));
// parallel
matches := 0;
sw := TStopWatch.StartNew;
Parallel.For(1, maxItems).Execute(
procedure(Value: integer)
var
index: integer;
found: integer;
begin
found := 0;
for index := low(referenceStr) to high(referenceStr) do
if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
inc(found);
AtomicIncrement(matches, found);
end);
Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
ProcessThreadMessages;
// serial
matches := 0;
sw := TStopWatch.StartNew;
for i := 1 to maxItems do
for j := low(referenceStr) to high(referenceStr) do
if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
inc(matches);
Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;
begin
while True do
DoTest;
end.
Run Code Online (Sandbox Code Playgroud)
而现在的输出.
System.Threading输出
Parallel matches: 19230817 in 374ms Serial matches: 19230817 in 2423ms Parallel matches: 19230698 in 374ms Serial matches: 19230698 in 2409ms Parallel matches: 19230556 in 368ms Serial matches: 19230556 in 2433ms Parallel matches: 19230635 in 2412ms Serial matches: 19230635 in 2430ms Parallel matches: 19230843 in 2441ms Serial matches: 19230843 in 2413ms Parallel matches: 19230905 in 2493ms Serial matches: 19230905 in 2423ms Parallel matches: 19231032 in 2430ms Serial matches: 19231032 in 2443ms Parallel matches: 19230669 in 2440ms Serial matches: 19230669 in 2473ms Parallel matches: 19230811 in 2404ms Serial matches: 19230811 in 2432ms ....
OTL输出
Parallel matches: 19230667 in 422ms Serial matches: 19230667 in 2475ms Parallel matches: 19230663 in 335ms Serial matches: 19230663 in 2438ms Parallel matches: 19230889 in 395ms Serial matches: 19230889 in 2461ms Parallel matches: 19230874 in 391ms Serial matches: 19230874 in 2441ms Parallel matches: 19230617 in 385ms Serial matches: 19230617 in 2524ms Parallel matches: 19231021 in 368ms Serial matches: 19231021 in 2455ms Parallel matches: 19230904 in 357ms Serial matches: 19230904 in 2537ms Parallel matches: 19230568 in 373ms Serial matches: 19230568 in 2456ms Parallel matches: 19230758 in 333ms Serial matches: 19230758 in 2710ms Parallel matches: 19230580 in 371ms Serial matches: 19230580 in 2532ms Parallel matches: 19230534 in 336ms Serial matches: 19230534 in 2436ms Parallel matches: 19230879 in 368ms Serial matches: 19230879 in 2419ms Parallel matches: 19230651 in 409ms Serial matches: 19230651 in 2598ms Parallel matches: 19230461 in 357ms ....
我让OTL版本运行了很长时间,模式从未改变过.并行版本总是比串行版快7倍左右.
结论
代码非常简单.可以得出的唯一合理结论是实施System.Threading
有缺陷.
有许多与新System.Threading
库有关的错误报告.所有的迹象都表明它的质量很差.Embarcadero在发布不合标准的库代码方面有着悠久的历史记录.我在想TMonitor
,XE3字符串助手,早期版本的System.IOUtils
FireMonkey.名单还在继续.
很明显,Embarcadero的质量是一个大问题.代码发布,很明显没有经过充分测试,如果有的话.这对于线程库来说尤其麻烦,其中错误可以处于休眠状态并且仅在特定的硬件/软件配置中公开.TMonitor
我的经验让我相信Embarcadero没有足够的专业知识来生产高质量,正确的线程代码.
我的建议是你不应该System.Threading
以目前的形式使用.在可以看出它具有足够的质量和正确性的时候,它应该被避开.我建议你使用OTL.
编辑:该程序的原始OTL版本有一个实时内存泄漏,这是由于一个丑陋的实现细节.Parallel.For使用.Unobserved修饰符创建任务.这导致所述任务仅在某个内部消息窗口收到"任务已终止"消息时被销毁.此窗口在与Parallel.For调用者相同的线程中创建 - 即在本例中的主线程中.由于主线程没有处理消息,因此任务永远不会被破坏,内存消耗(以及其他资源)只会堆积起来.有可能因为该程序在一段时间后被绞死了.
归档时间: |
|
查看次数: |
2110 次 |
最近记录: |