C vs汇编程序与NEON性能

Ham*_*mer 18 c iphone assembly image-processing neon

我正在开发一个可以进行实时图像处理的iPhone应用程序.其管道中最早的步骤之一是将BGRA图像转换为灰度图像.我尝试了几种不同的方法,时间结果的差异远大于我想象的可能.首先我尝试使用C.我通过添加B + 2*G + R/4来近似转换为光度

void BGRA_To_Byte(Image<BGRA> &imBGRA, Image<byte> &imByte)
{
uchar *pIn = (uchar*) imBGRA.data;
uchar *pLimit = pIn + imBGRA.MemSize();

uchar *pOut = imByte.data;
for(; pIn < pLimit; pIn+=16)   // Does four pixels at a time
{
    unsigned int sumA = pIn[0] + 2 * pIn[1] + pIn[2];
    pOut[0] = sumA / 4;
    unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
    pOut[1] = sumB / 4;
    unsigned int sumC = pIn[8] + 2 * pIn[9] + pIn[10];
    pOut[2] = sumC / 4;
    unsigned int sumD = pIn[12] + 2 * pIn[13] + pIn[14];
    pOut[3] = sumD / 4;
    pOut +=4;
}       
}
Run Code Online (Sandbox Code Playgroud)

此代码需要55毫秒才能转换352x288图像.然后我发现了一些基本相同的汇编程序代码

void BGRA_To_Byte(Image<BGRA> &imBGRA, Image<byte> &imByte)
{
uchar *pIn = (uchar*) imBGRA.data;
uchar *pLimit = pIn + imBGRA.MemSize();

unsigned int *pOut = (unsigned int*) imByte.data;

for(; pIn < pLimit; pIn+=16)   // Does four pixels at a time
{
  register unsigned int nBGRA1 asm("r4");
  register unsigned int nBGRA2 asm("r5");
  unsigned int nZero=0;
  unsigned int nSum1;
  unsigned int nSum2;
  unsigned int nPacked1;
  asm volatile(

               "ldrd %[nBGRA1], %[nBGRA2], [ %[pIn], #0]       \n"   // Load in two BGRA words
               "usad8 %[nSum1], %[nBGRA1], %[nZero]  \n"  // Add R+G+B+A 
               "usad8 %[nSum2], %[nBGRA2], %[nZero]  \n"  // Add R+G+B+A 
               "uxtab %[nSum1], %[nSum1], %[nBGRA1], ROR #8    \n"   // Add G again
               "uxtab %[nSum2], %[nSum2], %[nBGRA2], ROR #8    \n"   // Add G again
               "mov %[nPacked1], %[nSum1], LSR #2 \n"    // Init packed word   
               "mov %[nSum2], %[nSum2], LSR #2 \n"   // Div by four
               "add %[nPacked1], %[nPacked1], %[nSum2], LSL #8 \n"   // Add to packed word                 

               "ldrd %[nBGRA1], %[nBGRA2], [ %[pIn], #8]       \n"   // Load in two more BGRA words
               "usad8 %[nSum1], %[nBGRA1], %[nZero]  \n"  // Add R+G+B+A 
               "usad8 %[nSum2], %[nBGRA2], %[nZero]  \n"  // Add R+G+B+A 
               "uxtab %[nSum1], %[nSum1], %[nBGRA1], ROR #8    \n"   // Add G again
               "uxtab %[nSum2], %[nSum2], %[nBGRA2], ROR #8    \n"   // Add G again
               "mov %[nSum1], %[nSum1], LSR #2 \n"   // Div by four
               "add %[nPacked1], %[nPacked1], %[nSum1], LSL #16 \n"   // Add to packed word
               "mov %[nSum2], %[nSum2], LSR #2 \n"   // Div by four
               "add %[nPacked1], %[nPacked1], %[nSum2], LSL #24 \n"   // Add to packed word                 

               ///////////
               ////////////

               : [pIn]"+r" (pIn), 
         [nBGRA1]"+r"(nBGRA1),
         [nBGRA2]"+r"(nBGRA2),
         [nZero]"+r"(nZero),
         [nSum1]"+r"(nSum1),
         [nSum2]"+r"(nSum2),
         [nPacked1]"+r"(nPacked1)
               :
               : "cc"  );
  *pOut = nPacked1;
  pOut++;
 }
 }
Run Code Online (Sandbox Code Playgroud)

此功能可在12ms内转换相同的图像,速度提高近5倍!我以前没有在汇编程序中编程,但我认为对于这样一个简单的操作,它不会比C快得多.通过这次成功,我继续搜索,发现一个NEON转换例子启发这里.

void greyScaleNEON(uchar* output_data, uchar* input_data, int tot_pixels)
{
__asm__ volatile("lsr          %2, %2, #3      \n"
                 "# build the three constants: \n"
                 "mov         r4, #28          \n" // Blue channel multiplier
                 "mov         r5, #151         \n" // Green channel multiplier
                 "mov         r6, #77          \n" // Red channel multiplier
                 "vdup.8      d4, r4           \n"
                 "vdup.8      d5, r5           \n"
                 "vdup.8      d6, r6           \n"
                 "0:                           \n"
                 "# load 8 pixels:             \n"
                 "vld4.8      {d0-d3}, [%1]!   \n"
                 "# do the weight average:     \n"
                 "vmull.u8    q7, d0, d4       \n"
                 "vmlal.u8    q7, d1, d5       \n"
                 "vmlal.u8    q7, d2, d6       \n"
                 "# shift and store:           \n"
                 "vshrn.u16   d7, q7, #8       \n" // Divide q3 by 256 and store in the d7
                 "vst1.8      {d7}, [%0]!      \n"
                 "subs        %2, %2, #1       \n" // Decrement iteration count
                 "bne         0b            \n" // Repeat unil iteration count is not zero
                 :
                 :  "r"(output_data),           
                 "r"(input_data),           
                 "r"(tot_pixels)        
                 : "r4", "r5", "r6"
                 );
}
Run Code Online (Sandbox Code Playgroud)

时间结果很难相信.它在1毫秒内转换相同的图像.比汇编程序快12倍,比C语言快55倍.我不知道这样的性能提升是可能的.鉴于此,我有几个问题.首先,我在C代码中做了一些非常错误的事情吗?我仍然觉得很难相信它太慢了.其次,如果这些结果完全准确,我可以期望在哪些情况下看到这些收益?您可以想象我对使管道的其他部分运行速度提高55倍的前景感到非常兴奋.我是否应该学习汇编程序/ NEON并在任何需要相当长时间的循环中使用它们?

更新1:我已经在http://temp-share.com/show/f3Yg87jQn的文本文件中发布了我的C函数的汇编输出.这个 太大了,不能直接包含在这里.

使用OpenCV函数完成定时.

double duration = static_cast<double>(cv::getTickCount()); 
//function call 
duration = static_cast<double>(cv::getTickCount())-duration;
duration /= cv::getTickFrequency();
//duration should now be elapsed time in ms
Run Code Online (Sandbox Code Playgroud)

结果

我测试了几个建议的改进.首先,根据Viktor的建议,我重新排序内部循环以将所有提取放在第一位.然后内环看起来像.

for(; pIn < pLimit; pIn+=16)   // Does four pixels at a time
{     
  //Jul 16, 2012 MR: Read and writes collected
  sumA = pIn[0] + 2 * pIn[1] + pIn[2];
  sumB = pIn[4] + 2 * pIn[5] + pIn[6];
  sumC = pIn[8] + 2 * pIn[9] + pIn[10];
  sumD = pIn[12] + 2 * pIn[13] + pIn[14];
  pOut +=4;
  pOut[0] = sumA / 4;
  pOut[1] = sumB / 4;
  pOut[2] = sumC / 4;
  pOut[3] = sumD / 4;
}
Run Code Online (Sandbox Code Playgroud)

这一变化使处理时间缩短到53毫秒,提高了2毫秒.接下来按照Victor的建议我改变了我的函数以获取uint.然后内环看起来像

unsigned int* in_int = (unsigned int*) original.data;
unsigned int* end = (unsigned int*) in_int + out_length;
uchar* out = temp.data;

for(; in_int < end; in_int+=4)   // Does four pixels at a time
{
    unsigned int pixelA = in_int[0];
    unsigned int pixelB = in_int[1];
    unsigned int pixelC = in_int[2];
    unsigned int pixelD = in_int[3];

    uchar* byteA = (uchar*)&pixelA;
    uchar* byteB = (uchar*)&pixelB;
    uchar* byteC = (uchar*)&pixelC;
    uchar* byteD = (uchar*)&pixelD;         

    unsigned int sumA = byteA[0] + 2 * byteA[1] + byteA[2];
    unsigned int sumB = byteB[0] + 2 * byteB[1] + byteB[2];
    unsigned int sumC = byteC[0] + 2 * byteC[1] + byteC[2];
    unsigned int sumD = byteD[0] + 2 * byteD[1] + byteD[2];

    out[0] = sumA / 4;
    out[1] = sumB / 4;
    out[2] = sumC / 4;
    out[3] = sumD / 4;
    out +=4;
    }
Run Code Online (Sandbox Code Playgroud)

这种修改产生了巨大的影响,处理时间减少到14毫秒,下降了39毫秒(75%).最后的结果非常接近11ms的汇编程序性能.rob建议的最终优化是包含__restrict关键字.我在每个指针声明前面添加了它,改变了以下行

__restrict unsigned int* in_int = (unsigned int*) original.data;
unsigned int* end = (unsigned int*) in_int + out_length;
__restrict uchar* out = temp.data;  
...
__restrict uchar* byteA = (uchar*)&pixelA;
__restrict uchar* byteB = (uchar*)&pixelB;
__restrict uchar* byteC = (uchar*)&pixelC;
__restrict uchar* byteD = (uchar*)&pixelD;  
...     
Run Code Online (Sandbox Code Playgroud)

这些变化对处理时间没有可测量的影响.感谢您的帮助,我将来会更加关注内存管理.

Vik*_*pov 5

这里有一个关于NEON"成功"的一些原因的解释:http://hilbert-space.de/?p = 22

尝试使用"-S -O3"开关编译C代码,以查看GCC编译器的优化输出.

恕我直言,成功的关键是两个装配版本采用的优化读/写模式.NEON/MMX /其他矢量引擎也支持饱和(钳位结果为0..255,而不必使用'无符号整数').

在循环中查看以下行:

unsigned int sumA = pIn[0] + 2 * pIn[1] + pIn[2];
pOut[0] = sumA / 4;
unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
pOut[1] = sumB / 4;
unsigned int sumC = pIn[8] + 2 * pIn[9] + pIn[10];
pOut[2] = sumC / 4;
unsigned int sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut[3] = sumD / 4;
pOut +=4;
Run Code Online (Sandbox Code Playgroud)

读写真的好坏参半.循环周期的稍微好一点的版本

// and the pIn reads can be combined into a single 4-byte fetch
sumA = pIn[0] + 2 * pIn[1] + pIn[2];
sumB = pIn[4] + 2 * pIn[5] + pIn[6];
sumC = pIn[8] + 2 * pIn[9] + pIn[10];
sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut +=4;
pOut[0] = sumA / 4;
pOut[1] = sumB / 4;
pOut[2] = sumC / 4;
pOut[3] = sumD / 4;
Run Code Online (Sandbox Code Playgroud)

请记住,这里的"unsigned in sumA"行实际上可能意味着alloca()调用(在堆栈上的分配),因此你在临时var分配上浪费了很多周期(函数调用4次).

此外,pIn [i]索引仅从内存中进行单字节提取.更好的方法是读取int然后提取单个字节.为了加快速度,使用"unsgined int*"读取4个字节(pIn [i*4 + 0],pIn [i*4 + 1],pIn [i*4 + 2],pIn [i*4 + 3]).

NEON版本明显优越:线条

             "# load 8 pixels:             \n"
             "vld4.8      {d0-d3}, [%1]!   \n"
Run Code Online (Sandbox Code Playgroud)

             "#save everything in one shot   \n"
             "vst1.8      {d7}, [%0]!      \n"
Run Code Online (Sandbox Code Playgroud)

节省大部分时间进行内存访问.

  • C是一个跨平台的汇编程序:).我不认为组装是绝对必要的,但知道事情的运作方式以及瓶颈可能肯定会有很大帮助.在这里,内存访问几乎总是慢于算术. (2认同)