英特尔x86汇编优化技术中的示例问题

Wod*_*dzu 21 delphi optimization x86 assembly basm

我正在学习汇编程序很长一段时间,我正在尝试重写一些简单的过程\函数来查看性能优势(如果有的话).我的主要开发工具是Delphi 2007,第一个例子将使用该语言,但它们也可以很容易地翻译成其他语言.

问题表明:

我们给出了一个无符号字节值,其中八位中的每一位代表一行屏幕中的一个像素.每个单个像素可以是实心(1)或透明(0).换句话说,我们在一个字节值中包含8个像素.我想将这些像素解压缩成一个8字节的数组,就像最年轻的像素(位)将落在数组的最低索引之下一样,依此类推.这是一个例子:

One byte value -----------> eight byte array

10011011 -----------------> [1][1][0][1][1][0][0][1]

Array index number ------->  0  1  2  3  4  5  6  7
Run Code Online (Sandbox Code Playgroud)

下面我介绍解决问题的五种方法.接下来,我将展示他们的时间比较以及我如何衡量这些时间.

我的问题包括两部分:

1.

我问您详细的有关方法的答案DecodePixels4aDecodePixels4b.为什么方法4b有点慢4a

例如,如果我的代码没有正确对齐,它会慢一些,那么告诉我给定方法中哪些指令可以更好地对齐,以及如何做到这一点不破坏方法.

我想看看这个理论背后的真实例子.请记住,我正在学习汇编,我想从你的答案中获得知识,这使我将来能够编写更好的优化代码.

2.

你能写更快的常规DecodePixels4a吗?如果是,请提供并描述您已采取的优化步骤.通过更快的例程,我的意思是在测试环境中在最短的时间段内运行的例程,在此处提供的所有例程中.

允许使用所有Intel系列处理器以及与之兼容的处理器.

您将在下面找到我编写的例程:

procedure DecodePixels1(EncPixels: Byte; var DecPixels: TDecodedPixels);
var
  i3: Integer;
begin
  DecPixels[0] := EncPixels and $01;
  for i3 := 1 to 7 do
  begin
    EncPixels := EncPixels shr 1;
    DecPixels[i3] := EncPixels and $01;
    //DecPixels[i3] := (EncPixels shr i3) and $01;  //this is even slower if you replace above 2 lines with it
  end;
end;


//Lets unroll the loop and see if it will be faster.
procedure DecodePixels2(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels[0] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[1] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[2] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[3] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[4] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[5] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[6] := EncPixels and $01;
  EncPixels := EncPixels shr 1;
  DecPixels[7] := EncPixels and $01;
end;


procedure DecodePixels3(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    push ecx;
    mov bl, al;
    and bl, $01;
    mov [edx], bl;
    mov ecx, $00;
@@Decode:
    inc ecx;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + ecx], bl;
    cmp ecx, $07;
    jnz @@Decode;
    pop ecx;
    pop ebx;
    pop eax;
  end;
end;


//Unrolled assembly loop
procedure DecodePixels4a(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    mov bl, al;
    and bl, $01;
    mov  [edx], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $01], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $02], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $03], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $04], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $05], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $06], bl;
    shr al, $01;
    mov bl, al;
    and bl, $01;
    mov [edx + $07], bl;
    pop ebx;
    pop eax;
  end;
end;


// it differs compared to 4a only in switching two instructions (but seven times)
procedure DecodePixels4b(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push eax;
    push ebx;
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx], bl;        //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $01], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $02], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $03], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $04], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $05], bl;  //
    mov bl, al;
    and bl, $01;
    shr al, $01;          //
    mov [edx + $06], bl;  //
    mov bl, al;
    and bl, $01;
    mov [edx + $07], bl;
    pop ebx;
    pop eax;
  end;
end;
Run Code Online (Sandbox Code Playgroud)

以下是我如何测试它们:

program Test;

{$APPTYPE CONSOLE}

uses
  SysUtils, Windows;

type
  TDecodedPixels = array[0..7] of Byte;
var
  Pixels: TDecodedPixels;
  Freq, TimeStart, TimeEnd :Int64;
  Time1, Time2, Time3, Time4a, Time4b: Extended;
  i, i2: Integer;

begin
  if QueryPerformanceFrequency(Freq) then
  begin
    for i2 := 1 to 100 do
    begin
      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels1(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time1 := Time1 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels2(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time2 := Time2 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels3(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time3 := Time3 + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels4a(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time4a := Time4a + ((TimeEnd - TimeStart) / Freq * 1000);

      QueryPerformanceCounter(TimeStart);
      for i := 1 to 100000 do
        DecodePixels4b(155, Pixels);
      QueryPerformanceCounter(TimeEnd);
      Time4b := Time4b + ((TimeEnd - TimeStart) / Freq * 1000);

    end;
    Writeln('Time1 : ' + FloatToStr(Time1 / 100) + ' ms.    <- Delphi loop.');
    Writeln('Time2 : ' + FloatToStr(Time2 / 100) + ' ms.    <- Delphi unrolled loop.');
    Writeln('Time3 : ' + FloatToStr(Time3/ 100) + ' ms.    <- BASM loop.');
    Writeln('Time4a : ' + FloatToStr(Time4a / 100) + ' ms.    <- BASM unrolled loop.');
    Writeln('Time4b : ' + FloatToStr(Time4b / 100) + ' ms.    <- BASM unrolled loop instruction switch.');
  end;
  Readln;
end.
Run Code Online (Sandbox Code Playgroud)

以下是我的机器(Win32 XP上的英特尔®奔腾®E2180)的结果:

Time1  : 1,68443549919493 ms.     <- Delphi loop.
Time2  : 1,33773024572211 ms.     <- Delphi unrolled loop.
Time3  : 1,37015271374424 ms.     <- BASM loop.
Time4a : 0,822916962526627 ms.    <- BASM unrolled loop.
Time4b : 0,862914462301607 ms.    <- BASM unrolled loop instruction switch.
Run Code Online (Sandbox Code Playgroud)

结果相当稳定 - 每次测试之间的时间差别只有几个百分点.这总是如此:Time1 > Time3 > Time 2 > Time4b > Time4a

所以我认为Time4a和Time4b之间的区别取决于该方法中的指令切换DecodePixels4b.有时它是4%有时它高达10%但4b总是慢于4a.

我正在考虑使用MMX指令同时将内存写入8个字节的另一种方法,但我无法想出将字节解压缩到64位寄存器的快速方法.

感谢您的时间.


谢谢你们的宝贵意见.Whish我可以同时回答所有人,不幸的是与现代CPU相比,我只有一个"管道"并且当时只能执行一条指令"回复";-)所以,我会尝试总结一些事情在这里,并在您的答案下写下其他评论.

首先,我想说在发布我的问题之前,我想出了Wouter van Nifterick提出的解决方案,它实际上我的汇编代码.所以我决定不在这里发布那个例程,但你可能会看到我在循环Delphi版本的例程中也采用了相同的方法.它在那里被评论,因为它给了我更糟糕的结果.

这对我来说是一个谜.我用Wouter和PhilS的例程再次运行我的代码,结果如下:

Time1  : 1,66535493194387 ms.     <- Delphi loop.
Time2  : 1,29115785420688 ms.     <- Delphi unrolled loop.
Time3  : 1,33716934524107 ms.     <- BASM loop.
Time4a : 0,795041753757838 ms.    <- BASM unrolled loop.
Time4b : 0,843520166815013 ms.    <- BASM unrolled loop instruction switch.
Time5  : 1,49457681191307 ms.     <- Wouter van Nifterick, Delphi unrolled
Time6  : 0,400587402866258 ms.    <- PhiS, table lookup Delphi
Time7  : 0,325472442519827 ms.    <- PhiS, table lookup Delphi inline
Time8  : 0,37350491544239 ms.     <- PhiS, table lookup BASM
Run Code Online (Sandbox Code Playgroud)

看看Time5的结果,很奇怪不是吗?我想我有不同的Delphi版本,因为我生成的汇编代码与Wouter提供的不同.

第二个主要编辑:


我知道为什么常规5在我的machnie上慢了.我在编译器选项中检查了"范围检查"和"溢出检查".我已经添加assembler了例程指令,9看它是否有帮助.似乎用这个指令组装程序和Delphi内联变体一样好,甚至稍微好一些.

以下是最终结果:

Time1  : 1,22508325749317 ms.     <- Delphi loop.
Time2  : 1,33004145373084 ms.     <- Delphi unrolled loop.
Time3  : 1,1473583622526 ms.      <- BASM loop.
Time4a : 0,77322594033463 ms.     <- BASM unrolled loop.
Time4b : 0,846033593023372 ms.    <- BASM unrolled loop instruction switch.
Time5  : 0,688689382044384 ms.    <- Wouter van Nifterick, Delphi unrolled
Time6  : 0,503233741036693 ms.    <- PhiS, table lookup Delphi
Time7  : 0,385254722925063 ms.    <- PhiS, table lookup Delphi inline
Time8  : 0,432993919452751 ms.    <- PhiS, table lookup BASM
Time9  : 0,362680491244212 ms.    <- PhiS, table lookup BASM with assembler directive
Run Code Online (Sandbox Code Playgroud)

第三大编辑:


在看来@Pascal Cuoq和@j_random_hacker例程之间在执行时间的差4a,4b并且5是由数据相关性引起的.但是,基于我所做的进一步测试,我不得不反对这种观点.

我也发明了4c基于的新例程4a.这里是:

procedure DecodePixels4c(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  asm
    push ebx;
    mov bl, al;
    and bl, 1;
    mov [edx], bl;
    mov bl, al;
    shr bl, 1;
    and bl, 1;
    mov [edx + $01], bl;
    mov bl, al;
    shr bl, 2;
    and bl, 1;
    mov [edx + $02], bl;
    mov bl, al;
    shr bl, 3;
    and bl, 1;
    mov [edx + $03], bl;
    mov bl, al;
    shr bl, 4;
    and bl, 1;
    mov [edx + $04], bl;
    mov bl, al;
    shr bl, 5;
    and bl, 1;
    mov [edx + $05], bl;
    mov bl, al;
    shr bl, 6;
    and bl, 1;
    mov [edx + $06], bl;
    shr al, 7;
    and al, 1;
    mov [edx + $07], al;
    pop ebx;
  end;
end;
Run Code Online (Sandbox Code Playgroud)

我会说它依赖于数据.

以下是测试和结果.我做了四次测试,以确保没有意外.我还为GJ提出的例程添加了新的时间(Time10a,Time10b).

          Test1  Test2  Test3  Test4

Time1   : 1,211  1,210  1,220  1,213
Time2   : 1,280  1,258  1,253  1,332
Time3   : 1,129  1,138  1,130  1,160

Time4a  : 0,690  0,682  0,617  0,635
Time4b  : 0,707  0,698  0,706  0,659
Time4c  : 0,679  0,685  0,626  0,625
Time5   : 0,715  0,682  0,686  0,679

Time6   : 0,490  0,485  0,522  0,514
Time7   : 0,323  0,333  0,336  0,318
Time8   : 0,407  0,403  0,373  0,354
Time9   : 0,352  0,378  0,355  0,355
Time10a : 1,823  1,812  1,807  1,813
Time10b : 1,113  1,120  1,115  1,118
Time10c : 0,652  0,630  0,653  0,633
Time10d : 0,156  0,155  0,172  0,160  <-- current winner!
Run Code Online (Sandbox Code Playgroud)

正如你可能会看到的结果4a,4b,4c5非常接近对方.这是为什么?因为我已经从4a,4b(4c已经没有它)中删除了两条指令:push eaxpop eax.因为我知道我不会在我的代码中的任何其他地方使用eax下的值我不必保留它.现在我的代码只有一对push/pop,因此例程5.例程5保留了eax的值,因为它首先在ecx下复制它,但它不能保留ecx.

所以我的结论是:5和4a和4b(第三次编辑之前)的执行时间差异与数据依赖性无关,而是由额外的一对推/弹指令引起的.

我对你的评论很感兴趣.

几天之后,GJ发明了比PhiS更快的例程(时间10d).干得好GJ!

Wou*_*ick 16

一般来说,我个人不会试图通过在汇编程序级别使用技巧来优化代码,除非你真的需要额外的2%或3%的速度,并且你愿意支付更难阅读的代码价格,维护和移植.

为了挤压最后1%,您甚至可能必须维护每个处理器优化的多个版本,如果出现更新的处理器和改进的pascal编译器,您将无法从中受益.

这个Delphi代码比你最快的汇编程序代码更快:

procedure DecodePixels5(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels[0] := (EncPixels shr 0) and $01;
  DecPixels[1] := (EncPixels shr 1) and $01;
  DecPixels[2] := (EncPixels shr 2) and $01;
  DecPixels[3] := (EncPixels shr 3) and $01;
  DecPixels[4] := (EncPixels shr 4) and $01;
  DecPixels[5] := (EncPixels shr 5) and $01;
  DecPixels[6] := (EncPixels shr 6) and $01;
  DecPixels[7] := (EncPixels shr 7) and $01;
end;


Results:

Time1  : 1,03096806151283 ms.    <- Delphi loop.
Time2  : 0,740308641141395 ms.   <- Delphi unrolled loop.
Time3  : 0,996602425688886 ms.   <- BASM loop.
Time4a : 0,608267951561275 ms.   <- BASM unrolled loop.
Time4b : 0,574162510648039 ms.   <- BASM unrolled loop instruction switch.
Time5  : 0,499628206138524 ms. !!!  <- Delphi unrolled loop 5.
Run Code Online (Sandbox Code Playgroud)

它很快,因为操作只能通过寄存器完成,而不需要存储和获取内存.现代处理器部分并行地执行此操作(可以在前一个完成之前开始新操作),因为连续指令的结果彼此独立.

机器代码如下所示:

  push ebx;
  // DecPixels[0] := (EncPixels shr 0) and 1;
  movzx ecx,al
  mov ebx,ecx
  //  shr ebx,$00
  and bl,$01
  mov [edx],bl
  // DecPixels[1] := (EncPixels shr 1) and 1;
  mov ebx,ecx
  shr ebx,1
  and bl,$01
  mov [edx+$01],bl
  // DecPixels[2] := (EncPixels shr 2) and 1;
  mov ebx,ecx
  shr ebx,$02
  and bl,$01
  mov [edx+$02],bl
  // DecPixels[3] := (EncPixels shr 3) and 1;
  mov ebx,ecx
  shr ebx,$03
  and bl,$01
  mov [edx+$03],bl
  // DecPixels[4] := (EncPixels shr 4) and 1;
  mov ebx,ecx
  shr ebx,$04
  and bl,$01
  mov [edx+$04],bl
  // DecPixels[5] := (EncPixels shr 5) and 1;
  mov ebx,ecx
  shr ebx,$05
  and bl,$01
  mov [edx+$05],bl
  // DecPixels[6] := (EncPixels shr 6) and 1;
  mov ebx,ecx
  shr ebx,$06
  and bl,$01
  mov [edx+$06],bl
  // DecPixels[7] := (EncPixels shr 7) and 1;
  shr ecx,$07
  and cl,$01
  mov [edx+$07],cl
  pop ebx;
Run Code Online (Sandbox Code Playgroud)

编辑:根据建议,表查找确实更快.

var
  PixelLookup:Array[byte] of TDecodedPixels;

// You could precalculate, but the performance gain would hardly be worth it because you call this once only.
for I := 0 to 255 do
  DecodePixels5b(I, PixelLookup[I]);


procedure DecodePixels7(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels := PixelLookup[EncPixels];
end;

Results:

Time1  : 1,03096806151283 ms.    <- Delphi loop.
Time2  : 0,740308641141395 ms.   <- Delphi unrolled loop.
Time3  : 0,996602425688886 ms.   <- BASM loop.
Time4a : 0,608267951561275 ms.   <- BASM unrolled loop.
Time4b : 0,574162510648039 ms.   <- BASM unrolled loop instruction switch.
Time5  : 0,499628206138524 ms. !!!  <- Delphi unrolled loop 5.
Time7 : 0,251533475182096 ms.    <- simple table lookup
Run Code Online (Sandbox Code Playgroud)

  • 提高速度的另一个可能原因:您现在拥有8*个独立*执行流程,可以在现代超标量处理器(特别是P4及以上)上并行执行(部分).之前,在前一位的计算完成之前,每个位的计算都无法开始. (4认同)

GJ.*_*GJ. 6

你的asm代码相对较慢,因为使用堆栈结束写入内存8次.检查一下......

procedure DecodePixels(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  xor   ecx, ecx
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 1
  mov   [DecPixels + 4], ecx
  xor   ecx, ecx
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 8
  add   al, al
  rcl   ecx, 1
  mov   [DecPixels], ecx
end;
Run Code Online (Sandbox Code Playgroud)

也许比使用查找表的代码更快!

改良版:

procedure DecodePixelsI(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  mov   ecx, 0    //Faster than: xor   ecx, ecx
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  ror   ecx, 16
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  mov   [DecPixels + 4], ecx
  mov   ecx, 0    //Faster than: xor   ecx, ecx
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  ror   ecx, 16
  add   al, al
  rcl   ch, 1
  add   al, al
  rcl   cl, 1
  mov   [DecPixels], ecx
end;
Run Code Online (Sandbox Code Playgroud)

版本3:

procedure DecodePixelsX(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
  add   al, al
  setc  byte ptr[DecPixels + 7]
  add   al, al
  setc  byte ptr[DecPixels + 6]
  add   al, al
  setc  byte ptr[DecPixels + 5]
  add   al, al
  setc  byte ptr[DecPixels + 4]
  add   al, al
  setc  byte ptr[DecPixels + 3]
  add   al, al
  setc  byte ptr[DecPixels + 2]
  add   al, al
  setc  byte ptr[DecPixels + 1]
  setnz byte ptr[DecPixels]
end;
Run Code Online (Sandbox Code Playgroud)

第4版:

const Uint32DecPix : array [0..15] of cardinal = (
  $00000000, $00000001, $00000100, $00000101,
  $00010000, $00010001, $00010100, $00010101,
  $01000000, $01000001, $01000100, $01000101,
  $01010000, $01010001, $01010100, $01010101
  );

procedure DecodePixelsY(EncPixels: byte; var DecPixels: TDecodedPixels); inline;
begin
  pcardinal(@DecPixels)^ := Uint32DecPix[EncPixels and $0F];
  pcardinal(cardinal(@DecPixels) + 4)^ := Uint32DecPix[(EncPixels and $F0) shr 4];
end;
Run Code Online (Sandbox Code Playgroud)


Phi*_*hiS 5

扩展了Nick D的答案,我尝试了以下基于表查找的版本,所有这些版本都比你给出的实现更快(并且比Wouter van Nifterick的代码更快).

给定以下打包数组:


      const Uint64DecPix : PACKED ARRAY [0..255] OF UINT64 =
  ( $0000000000000000, $0000000000000001, $0000000000000100, $0000000000000101, $0000000000010000, $0000000000010001, $0000000000010100, $0000000000010101, $0000000001000000, $0000000001000001, $0000000001000100, $0000000001000101, $0000000001010000, $0000000001010001, $0000000001010100, $0000000001010101,
    $0000000100000000, $0000000100000001, $0000000100000100, $0000000100000101, $0000000100010000, $0000000100010001, $0000000100010100, $0000000100010101, $0000000101000000, $0000000101000001, $0000000101000100, $0000000101000101, $0000000101010000, $0000000101010001, $0000000101010100, $0000000101010101,
    $0000010000000000, $0000010000000001, $0000010000000100, $0000010000000101, $0000010000010000, $0000010000010001, $0000010000010100, $0000010000010101, $0000010001000000, $0000010001000001, $0000010001000100, $0000010001000101, $0000010001010000, $0000010001010001, $0000010001010100, $0000010001010101,
    $0000010100000000, $0000010100000001, $0000010100000100, $0000010100000101, $0000010100010000, $0000010100010001, $0000010100010100, $0000010100010101, $0000010101000000, $0000010101000001, $0000010101000100, $0000010101000101, $0000010101010000, $0000010101010001, $0000010101010100, $0000010101010101,
    $0001000000000000, $0001000000000001, $0001000000000100, $0001000000000101, $0001000000010000, $0001000000010001, $0001000000010100, $0001000000010101, $0001000001000000, $0001000001000001, $0001000001000100, $0001000001000101, $0001000001010000, $0001000001010001, $0001000001010100, $0001000001010101,
    $0001000100000000, $0001000100000001, $0001000100000100, $0001000100000101, $0001000100010000, $0001000100010001, $0001000100010100, $0001000100010101, $0001000101000000, $0001000101000001, $0001000101000100, $0001000101000101, $0001000101010000, $0001000101010001, $0001000101010100, $0001000101010101,
    $0001010000000000, $0001010000000001, $0001010000000100, $0001010000000101, $0001010000010000, $0001010000010001, $0001010000010100, $0001010000010101, $0001010001000000, $0001010001000001, $0001010001000100, $0001010001000101, $0001010001010000, $0001010001010001, $0001010001010100, $0001010001010101,
    $0001010100000000, $0001010100000001, $0001010100000100, $0001010100000101, $0001010100010000, $0001010100010001, $0001010100010100, $0001010100010101, $0001010101000000, $0001010101000001, $0001010101000100, $0001010101000101, $0001010101010000, $0001010101010001, $0001010101010100, $0001010101010101,
    $0100000000000000, $0100000000000001, $0100000000000100, $0100000000000101, $0100000000010000, $0100000000010001, $0100000000010100, $0100000000010101, $0100000001000000, $0100000001000001, $0100000001000100, $0100000001000101, $0100000001010000, $0100000001010001, $0100000001010100, $0100000001010101,
    $0100000100000000, $0100000100000001, $0100000100000100, $0100000100000101, $0100000100010000, $0100000100010001, $0100000100010100, $0100000100010101, $0100000101000000, $0100000101000001, $0100000101000100, $0100000101000101, $0100000101010000, $0100000101010001, $0100000101010100, $0100000101010101,
    $0100010000000000, $0100010000000001, $0100010000000100, $0100010000000101, $0100010000010000, $0100010000010001, $0100010000010100, $0100010000010101, $0100010001000000, $0100010001000001, $0100010001000100, $0100010001000101, $0100010001010000, $0100010001010001, $0100010001010100, $0100010001010101,
    $0100010100000000, $0100010100000001, $0100010100000100, $0100010100000101, $0100010100010000, $0100010100010001, $0100010100010100, $0100010100010101, $0100010101000000, $0100010101000001, $0100010101000100, $0100010101000101, $0100010101010000, $0100010101010001, $0100010101010100, $0100010101010101,
    $0101000000000000, $0101000000000001, $0101000000000100, $0101000000000101, $0101000000010000, $0101000000010001, $0101000000010100, $0101000000010101, $0101000001000000, $0101000001000001, $0101000001000100, $0101000001000101, $0101000001010000, $0101000001010001, $0101000001010100, $0101000001010101,
    $0101000100000000, $0101000100000001, $0101000100000100, $0101000100000101, $0101000100010000, $0101000100010001, $0101000100010100, $0101000100010101, $0101000101000000, $0101000101000001, $0101000101000100, $0101000101000101, $0101000101010000, $0101000101010001, $0101000101010100, $0101000101010101,
    $0101010000000000, $0101010000000001, $0101010000000100, $0101010000000101, $0101010000010000, $0101010000010001, $0101010000010100, $0101010000010101, $0101010001000000, $0101010001000001, $0101010001000100, $0101010001000101, $0101010001010000, $0101010001010001, $0101010001010100, $0101010001010101,
    $0101010100000000, $0101010100000001, $0101010100000100, $0101010100000101, $0101010100010000, $0101010100010001, $0101010100010100, $0101010100010101, $0101010101000000, $0101010101000001, $0101010101000100, $0101010101000101, $0101010101010000, $0101010101010001, $0101010101010100, $0101010101010101);
PUint64DecPix : pointer = @Uint64DecPix;

你可以写下面的内容:


procedure DecodePixelsPS1Pas (EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
  DecPixels := TDecodedPixels(Uint64DecPix[EncPixels]);
end;

procedure DecodePixelsPS1PasInline (EncPixels: Byte; var DecPixels: TDecodedPixels); inline; begin DecPixels := TDecodedPixels(Uint64DecPix[EncPixels]); end;

procedure DecodePixelsPS1Asm (EncPixels: Byte; var DecPixels: TDecodedPixels); asm lea ecx, Uint64DecPix //[<-Added in EDIT 3] //mov ecx, dword ptr PUint64DecPix - alternative to the above line (slower for me) movzx eax, al movq xmm0, [8*eax+ecx] //Using XMM rather than MMX so we don't have to issue emms at the end movq [edx], xmm0 //use MOVQ because it doesn't need mem alignment end;

标准的PAS和ASM实现在速度方面非常相似,但标有"INLINE"的PAS实现是最快的,因为它消除了调用例程所涉及的所有调用/返回.

--EDIT--:我忘了说:因为你隐含地假设你的TDecodedPixels结构的内存布局,如果你声明它是更好的


PACKED ARRAY [0..7] of byte

--EDIT2--:这是我的比较结果:


Time1 : 2.51638266874701 ms.    <- Delphi loop.
Time2 : 2.11277620479698 ms.    <- Delphi unrolled loop.
Time3 : 2.21972066282167 ms.    <- BASM loop.
Time4a : 1.34093090043567 ms.    <- BASM unrolled loop.
Time4b : 1.52222070123437 ms.    <- BASM unrolled loop instruction switch.
Time5 : 1.17106364076999 ms.    <- Wouter van Nifterick
TimePS1 : 0.633099318488802 ms.    <- PS.Pas
TimePS2 : 0.551617593856202 ms.    <- PS.Pas Inline
TimePS3 : 0.70921094720139 ms.    <- PS.Asm (speed for version before 3rd EDIT)


归档时间:

查看次数:

3835 次

最近记录:

6 年 前