我有一个图像缓冲区,我需要转换为另一种格式.原始图像缓冲区是四个通道,每通道8位,Alpha,红色,绿色和蓝色.目标缓冲区是三个通道,每通道8位,蓝色,绿色和红色.
所以蛮力方法是:
// Assume a 32 x 32 pixel image
#define IMAGESIZE (32*32)
typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;
ARGB orig[IMAGESIZE];
BGR  dest[IMAGESIZE];
for(x = 0; x < IMAGESIZE; x++)
{
     dest[x].Red = orig[x].Red;
     dest[x].Green = orig[x].Green;
     dest[x].Blue = orig[x].Blue;
}
但是,我需要比循环和三字节副本提供的速度更快的速度.鉴于我在32位机器上运行,我希望可以使用一些技巧来减少内存读写次数.
每个图像都是至少4个像素的倍数.因此我们可以处理16个ARGB字节并将它们移动到每个循环12个RGB字节.也许这个事实可以用来加快速度,尤其是它可以很好地进入32位边界.
我可以访问OpenCL - 虽然这需要将整个缓冲区移动到GPU内存中,然后将结果移回去,OpenCL可以同时处理图像的许多部分,以及大内存块移动的事实非常有效可能使这个值得探索.
虽然我已经给出了上面的小缓冲区的例子,但我真的正在移动高清视频(1920x1080),有时更大,大多数是更小的缓冲区,所以虽然32x32情况可能是微不足道的,但是逐字节复制8.3MB的图像数据是真的,非常糟糕.
在Intel处理器(Core 2及更高版本)上运行,因此我知道存在流式和数据处理命令,但不知道 - 可能指向寻找专门数据处理指令的指针也不错.
这是进入一个OS X应用程序,我正在使用XCode 4.如果程序集是无痛的并且显而易见的方法,我可以沿着这条路走下去,但是在这个设置上没有这样做之前让我警惕沉没太多时间了.
伪代码很好 - 我不是在寻找一个完整的解决方案,只是算法和任何可能不会立即清楚的技巧的解释.
ugh*_*fhw 55
我写了4个不同的版本,通过交换字节来工作.我使用gcc 4.2.1编译它们-O3 -mssse3,在32MB随机数据上运行10次并找到平均值.
第一个版本使用C循环分别转换每个像素,使用OSSwapInt32函数(编译为bswap指令-O3).
void swap1(ARGB *orig, BGR *dest, unsigned imageSize) {
    unsigned x;
    for(x = 0; x < imageSize; x++) {
        *((uint32_t*)(((uint8_t*)dest)+x*3)) = OSSwapInt32(((uint32_t*)orig)[x]);
    }
}
第二种方法执行相同的操作,但使用内联汇编循环而不是C循环.
void swap2(ARGB *orig, BGR *dest, unsigned imageSize) {
    asm (
        "0:\n\t"
        "movl   (%1),%%eax\n\t"
        "bswapl %%eax\n\t"
        "movl   %%eax,(%0)\n\t"
        "addl   $4,%1\n\t"
        "addl   $3,%0\n\t"
        "decl   %2\n\t"
        "jnz    0b"
        :: "D" (dest), "S" (orig), "c" (imageSize)
        : "flags", "eax"
    );
}
第三个版本只是一个poseur的答案的修改版本.我将内置函数转换为GCC等效函数并使用lddqu内置函数,因此输入参数不需要对齐.
typedef uint8_t v16qi __attribute__ ((vector_size (16)));
void swap3(uint8_t *orig, uint8_t *dest, size_t imagesize) {
    v16qi mask = __builtin_ia32_lddqu((const char[]){3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF});
    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 16, dest += 12) {
        __builtin_ia32_storedqu(dest,__builtin_ia32_pshufb128(__builtin_ia32_lddqu(orig),mask));
    }
}
最后,第四个版本是第三个版本的内联汇编.
void swap2_2(uint8_t *orig, uint8_t *dest, size_t imagesize) {
    int8_t mask[16] = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};//{0xFF, 0xFF, 0xFF, 0xFF, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3};
    asm (
        "lddqu  (%3),%%xmm1\n\t"
        "0:\n\t"
        "lddqu  (%1),%%xmm0\n\t"
        "pshufb %%xmm1,%%xmm0\n\t"
        "movdqu %%xmm0,(%0)\n\t"
        "add    $16,%1\n\t"
        "add    $12,%0\n\t"
        "sub    $4,%2\n\t"
        "jnz    0b"
        :: "r" (dest), "r" (orig), "r" (imagesize), "r" (mask)
        : "flags", "xmm0", "xmm1"
    );
}
在我的2010 MacBook Pro上,2.4 Ghz i5,4GB RAM,这些是每个平均时间:
Version 1: 10.8630 milliseconds Version 2: 11.3254 milliseconds Version 3: 9.3163 milliseconds Version 4: 9.3584 milliseconds
正如您所看到的,编译器在优化时已足够好,您无需编写汇编.此外,载体功能均只有1.5毫秒更快的数据32MB,所以如果你想支持的最早的英特尔Mac,它不支持SSSE3也不会造成大的伤害.
编辑:liori询问标准偏差信息.不幸的是,我没有保存数据点,所以我进行了另外25次迭代的测试.
              Average    | Standard Deviation
Brute force: 18.01956 ms | 1.22980 ms (6.8%)
Version 1:   11.13120 ms | 0.81076 ms (7.3%)
Version 2:   11.27092 ms | 0.66209 ms (5.9%)
Version 3:    9.29184 ms | 0.27851 ms (3.0%)
Version 4:    9.40948 ms | 0.32702 ms (3.5%)
此外,这是来自新测试的原始数据,以防任何人想要它.对于每次迭代,随机生成32MB数据集并运行这四个函数.下面列出了每个函数的运行时间(以微秒为单位).
Brute force: 22173 18344 17458 17277 17508 19844 17093 17116 19758 17395 18393 17075 17499 19023 19875 17203 16996 17442 17458 17073 17043 18567 17285 17746 17845 Version 1: 10508 11042 13432 11892 12577 10587 11281 11912 12500 10601 10551 10444 11655 10421 11285 10554 10334 10452 10490 10554 10419 11458 11682 11048 10601 Version 2: 10623 12797 13173 11130 11218 11433 11621 10793 11026 10635 11042 11328 12782 10943 10693 10755 11547 11028 10972 10811 11152 11143 11240 10952 10936 Version 3: 9036 9619 9341 8970 9453 9758 9043 10114 9243 9027 9163 9176 9168 9122 9514 9049 9161 9086 9064 9604 9178 9233 9301 9717 9156 Version 4: 9339 10119 9846 9217 9526 9182 9145 10286 9051 9614 9249 9653 9799 9270 9173 9103 9132 9550 9147 9157 9199 9113 9699 9354 9314
jus*_*eur 25
显而易见,使用pshufb.
#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>
// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
    assert((uintptr_t)orig % 16 == 0);
    assert(imagesize % 4 == 0);
    __m128i mask = _mm_set_epi8(-128, -128, -128, -128, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3);
    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 16, dest += 12) {
        _mm_storeu_si128((__m128i *)dest, _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), mask));
    }
}
MSN*_*MSN 15
只结合一个poseur和Jitamaro的答案,如果你假设输入和输出是16字节对齐的,如果你一次处理像素4,你可以使用shuffle,mask,ands和ors的组合来存储使用对齐商店.主要思想是生成四个中间数据集,然后将它们与掩码一起生成以选择相关的像素值并写出3个16字节的像素数据集.请注意,我没有编译它或尝试运行它.
EDIT2:有关底层代码结构的更多细节:
使用SSE2,16字节对齐读取和写入16字节可以获得更好的性能.由于您的3字节像素每16个像素只能对齐16个字节,因此我们一次使用混洗和遮罩以及16个输入像素的组合一次批量处理16个像素.
从LSB到MSB,输入看起来像这样,忽略了特定的组件:
s[0]: 0000 0000 0000 0000
s[1]: 1111 1111 1111 1111
s[2]: 2222 2222 2222 2222
s[3]: 3333 3333 3333 3333
而ouptuts看起来像这样:
d[0]: 000 000 000 000 111 1
d[1]:  11 111 111 222 222 22
d[2]:   2 222 333 333 333 333
因此,要生成这些输出,您需要执行以下操作(稍后我将指定实际的转换):
d[0]= combine_0(f_0_low(s[0]), f_0_high(s[1]))
d[1]= combine_1(f_1_low(s[1]), f_1_high(s[2]))
d[2]= combine_2(f_1_low(s[2]), f_1_high(s[3]))
现在,应该是什么combine_<x>样子?如果我们假设它d只是s压缩在一起,我们可以s用掩码和一个或者连接两个:
combine_x(left, right)= (left & mask(x)) | (right & ~mask(x))
其中(1表示选择左侧像素,0表示选择右侧像素):mask(0)= 111 111 111 111 000 0 mask(1)= 11 111 111 000 000 00 mask(2)= 1 111 000 000 000 000
但实际的转换(f_<x>_low,f_<x>_high)实际上并不那么简单.由于我们正在从源像素中反转和删除字节,因此实际转换是(为了简洁起见,第一个目的地):
d[0]= 
    s[0][0].Blue s[0][0].Green s[0][0].Red 
    s[0][1].Blue s[0][1].Green s[0][1].Red 
    s[0][2].Blue s[0][2].Green s[0][2].Red 
    s[0][3].Blue s[0][3].Green s[0][3].Red
    s[1][0].Blue s[1][0].Green s[1][0].Red
    s[1][1].Blue
如果将上述转换为从源到dest的字节偏移,则得到:d [0] =&s [0] +3&s [0] +2&s [0] +1 
        &s [0] +7&s [0] + 6&s [0] +5&s [0] +11&s [0] +10&s [0] +9&s [0] +15&s [0] +14&s [0] +13 
        &s [1] +3&s [1] +2&s [1] +1 
        &s [1] +7
(如果你看一下所有的s [0]偏移量,它们只会以相反的顺序匹配一个poseur的shuffle掩码.)
现在,我们可以生成一个shuffle掩码,将每个源字节映射到一个目标字节(X意味着我们不关心该值是什么):
f_0_low=  3 2 1  7 6 5  11 10 9  15 14 13  X X X  X
f_0_high= X X X  X X X   X  X X   X  X  X  3 2 1  7
f_1_low=    6 5  11 10 9  15 14 13  X X X   X X X  X  X
f_1_high=   X X   X  X X   X  X  X  3 2 1   7 6 5  11 10
f_2_low=      9  15 14 13  X  X  X  X X X   X  X  X  X  X  X
f_2_high=     X   X  X  X  3  2  1  7 6 5   11 10 9  15 14 13
我们可以通过查看我们用于每个源像素的掩码来进一步优化这一点.如果你看一下我们用于s [1]的shuffle mask:
f_0_high=  X  X  X  X  X  X  X  X  X  X  X  X  3  2  1  7
f_1_low=   6  5 11 10  9 15 14 13  X  X  X  X  X  X  X  X
由于两个shuffle蒙版不重叠,我们可以将它们组合起来,然后简单地屏蔽combine_中不相关的像素,这是我们已经做过的!以下代码执行所有这些优化(此外,它假定源和目标地址是16字节对齐的).此外,掩码以MSB-> LSB顺序以代码写出,以防您对排序感到困惑.
编辑:改变了商店,_mm_stream_si128因为你可能做了很多写,我们不想一定要刷新缓存.另外它应该是对齐的,所以你得到自由穿!
#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>
// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
    assert((uintptr_t)orig % 16 == 0);
    assert(imagesize % 16 == 0);
    __m128i shuf0 = _mm_set_epi8(
        -128, -128, -128, -128, // top 4 bytes are not used
        13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); // bottom 12 go to the first pixel
    __m128i shuf1 = _mm_set_epi8(
        7, 1, 2, 3, // top 4 bytes go to the first pixel
    -128, -128, -128, -128, // unused
        13, 14, 15, 9, 10, 11, 5, 6); // bottom 8 go to second pixel
    __m128i shuf2 = _mm_set_epi8(
        10, 11, 5, 6, 7, 1, 2, 3, // top 8 go to second pixel
    -128, -128, -128, -128, // unused
        13, 14, 15, 9); // bottom 4 go to third pixel
    __m128i shuf3 = _mm_set_epi8(
        13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3, // top 12 go to third pixel
        -128, -128, -128, -128); // unused
    __m128i mask0 = _mm_set_epi32(0, -1, -1, -1);
    __m128i mask1 = _mm_set_epi32(0,  0, -1, -1);
    __m128i mask2 = _mm_set_epi32(0,  0,  0, -1);
    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 64, dest += 48) {
        __m128i a= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), shuf0);
        __m128i b= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 1), shuf1);
        __m128i c= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 2), shuf2);
        __m128i d= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 3), shuf3);
        _mm_stream_si128((__m128i *)dest, _mm_or_si128(_mm_and_si128(a, mask0), _mm_andnot_si128(b, mask0));
        _mm_stream_si128((__m128i *)dest + 1, _mm_or_si128(_mm_and_si128(b, mask1), _mm_andnot_si128(c, mask1));
        _mm_stream_si128((__m128i *)dest + 2, _mm_or_si128(_mm_and_si128(c, mask2), _mm_andnot_si128(d, mask2));
    }
}
Ber*_*ann 11
我来晚会的时间有点晚了,似乎社区已经决定使用poseur的pshufb-answer但是分发2000声望,这是非常慷慨的我必须尝试一下.
这是我的版本,没有特定于平台的内在函数或机器特定的asm,我已经包含了一些跨平台的时序代码,如果你像我一样做这两点并激活编译器优化(寄存器优化,循环展开),显示4倍的加速:
#include "stdlib.h"
#include "stdio.h"
#include "time.h"
#define UInt8 unsigned char
#define IMAGESIZE (1920*1080) 
int main() {
    time_t  t0, t1;
    int frames;
    int frame; 
    typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
    typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;
    ARGB* orig = malloc(IMAGESIZE*sizeof(ARGB));
    if(!orig) {printf("nomem1");}
    BGR* dest = malloc(IMAGESIZE*sizeof(BGR));
    if(!dest) {printf("nomem2");}
    printf("to start original hit a key\n");
    getch();
    t0 = time(0);
    frames = 1200;
    for(frame = 0; frame<frames; frame++) {
        int x; for(x = 0; x < IMAGESIZE; x++) {
            dest[x].Red = orig[x].Red;
            dest[x].Green = orig[x].Green;
            dest[x].Blue = orig[x].Blue;
            x++;
        }
    }
    t1 = time(0);
    printf("finished original of %u frames in %u seconds\n", frames, t1-t0);
    // on my core 2 subnotebook the original took 16 sec 
    // (8 sec with compiler optimization -O3) so at 60 FPS 
    // (instead of the 1200) this would be faster than realtime 
    // (if you disregard any other rendering you have to do). 
    // However if you either want to do other/more processing 
    // OR want faster than realtime processing for e.g. a video-conversion 
    // program then this would have to be a lot faster still.
    printf("to start alternative hit a key\n");
    getch();
    t0 = time(0);
    frames = 1200;
    unsigned int* reader;
    unsigned int* end = reader+IMAGESIZE;
    unsigned int cur; // your question guarantees 32 bit cpu
    unsigned int next;
    unsigned int temp;
    unsigned int* writer;
    for(frame = 0; frame<frames; frame++) {
        reader = (void*)orig;
        writer = (void*)dest;
        next = *reader;
        reader++;
        while(reader<end) {
            cur = next;
            next = *reader;         
            // in the following the numbers are of course the bitmasks for 
            // 0-7 bits, 8-15 bits and 16-23 bits out of the 32
            temp = (cur&255)<<24 | (cur&65280)<<16|(cur&16711680)<<8|(next&255); 
            *writer = temp;
            reader++;
            writer++;
            cur = next;
            next = *reader;
            temp = (cur&65280)<<24|(cur&16711680)<<16|(next&255)<<8|(next&65280);
            *writer = temp;
            reader++;
            writer++;
            cur = next;
            next = *reader;
            temp = (cur&16711680)<<24|(next&255)<<16|(next&65280)<<8|(next&16711680);
            *writer = temp;
            reader++;
            writer++;
        }
    }
    t1 = time(0);
    printf("finished alternative of %u frames in %u seconds\n", frames, t1-t0);
    // on my core 2 subnotebook this alternative took 10 sec 
    // (4 sec with compiler optimization -O3)
}
结果是这些(在我的核心2子笔记本上):
F:\>gcc b.c -o b.exe
F:\>b
to start original hit a key
finished original of 1200 frames in 16 seconds
to start alternative hit a key
finished alternative of 1200 frames in 10 seconds
F:\>gcc b.c -O3 -o b.exe
F:\>b
to start original hit a key
finished original of 1200 frames in 8 seconds
to start alternative hit a key
finished alternative of 1200 frames in 4 seconds
你想使用Duff的设备:http://en.wikipedia.org/wiki/Duff%27s_device.它也在JavaScript中工作.这篇文章但是阅读http://lkml.indiana.edu/hypermail/linux/kernel/0008.2/0171.html有点好笑.想象一下具有512千克移动的Duff设备.
这个汇编函数应该这样做,但是我不知道你是否想保留旧数据,这个函数会覆盖它.
该代码适用于具有intel汇编风格的MinGW GCC,您必须对其进行修改以适合您的编译器/汇编器.
extern "C" {
    int convertARGBtoBGR(uint buffer, uint size);
    __asm(
        ".globl _convertARGBtoBGR\n"
        "_convertARGBtoBGR:\n"
        "  push ebp\n"
        "  mov ebp, esp\n"
        "  sub esp, 4\n"
        "  mov esi, [ebp + 8]\n"
        "  mov edi, esi\n"
        "  mov ecx, [ebp + 12]\n"
        "  cld\n"
        "  convertARGBtoBGR_loop:\n"
        "    lodsd          ; load value from [esi] (4byte) to eax, increment esi by 4\n"
        "    bswap eax ; swap eax ( A R G B ) to ( B G R A )\n"
        "    stosd          ; store 4 bytes to [edi], increment  edi by 4\n"
        "    sub edi, 1; move edi 1 back down, next time we will write over A byte\n"
        "    loop convertARGBtoBGR_loop\n"
        "  leave\n"
        "  ret\n"
    );
}
你应该这样称呼它:
convertARGBtoBGR( &buffer, IMAGESIZE );
此函数每个像素/数据包仅访问内存两次(1次读取,1次写入),与您的强力方法(至少/假设它已编译为注册)3次读取和3次写入操作相比.方法是相同的,但实现使它更有效.
结合这里的一个快速转换函数,给定对Core 2的访问权限,将转换拆分为线程是明智的,这些线程可以处理它们的第四个数据,就像在这个psudeocode中一样:
void bulk_bgrFromArgb(byte[] dest, byte[] src, int n)
{
       thread threads[] = {
           create_thread(bgrFromArgb, dest, src, n/4),
           create_thread(bgrFromArgb, dest+n/4, src+n/4, n/4),
           create_thread(bgrFromArgb, dest+n/2, src+n/2, n/4),
           create_thread(bgrFromArgb, dest+3*n/4, src+3*n/4, n/4),
       }
       join_threads(threads);
}