将0x1234转换为0x11223344

exe*_*ook 48 c c++ bit-manipulation

如何以高性能方式将十六进制数0x1234扩展为0x11223344?

unsigned int c = 0x1234, b;
b = (c & 0xff) << 4 | c & 0xf | (c & 0xff0) << 8
        | (c & 0xff00) << 12 | (c & 0xf000) << 16;
printf("%p -> %p\n", c, b);
Run Code Online (Sandbox Code Playgroud)

输出:

0x1234 -> 0x11223344
Run Code Online (Sandbox Code Playgroud)

我需要这个用于颜色转换.用户以0xARGB的形式提供数据,我需要将其转换为0xAARRGGBB.是的,可能有数百万,因为每个都可能是一个像素.1000x1000像素等于一百万.

实际情况更复杂,因为单个32位值包含前景色和背景色.所以0xARGBargb成为:[ 0xAARRGGBB, 0xaarrggbb ]

哦,是的,还有一件事,在实际的应用程序中我也否定了alpha,因为在OpenGL中 0xFF是非透明的,0x00是最透明的,这在大多数情况下是不方便的,因为通常你只需要一个RGB部分并且假设透明度是非本.

Apr*_*ori 53

这可以使用SSE2完成,如下所示:

void ExpandSSE2(unsigned __int64 in, unsigned __int64 &outLo, unsigned __int64 &outHi) {
  __m128i const mask = _mm_set1_epi16((short)0xF00F);
  __m128i const mul0 = _mm_set1_epi16(0x0011);
  __m128i const mul1 = _mm_set1_epi16(0x1000);
  __m128i       v;

  v = _mm_cvtsi64_si128(in); // Move the 64-bit value to a 128-bit register
  v = _mm_unpacklo_epi8(v, v);  // 0x12   -> 0x1212
  v = _mm_and_si128(v, mask);   // 0x1212 -> 0x1002
  v = _mm_mullo_epi16(v, mul0); // 0x1002 -> 0x1022
  v = _mm_mulhi_epu16(v, mul1); // 0x1022 -> 0x0102
  v = _mm_mullo_epi16(v, mul0); // 0x0102 -> 0x1122

  outLo = _mm_extract_epi64(v, 0);
  outHi = _mm_extract_epi64(v, 1);
}
Run Code Online (Sandbox Code Playgroud)

当然,您希望将函数的内容放在内部循环中并拉出常量.您还需要跳过x64寄存器并将值直接加载到128位SSE寄存器中.有关如何执行此操作的示例,请参阅下面的性能测试中的SSE2实现.

其核心是五条指令,一次对四个颜色值执行操作.因此,每个颜色值只有大约1.25条指令.还应注意,SSE2可在x64可用的任何地方使用.

Performance tests for an assortment of the solutions here A few people have mentioned that the only way to know what's faster is to run the code, and this is unarguably true. So I've compiled a few of the solutions into a performance test so we can compare apples to apples. I chose solutions which I felt were significantly different from the others enough to require testing. All the solutions read from memory, operate on the data, and write back to memory. In practice some of the SSE solutions will require additional care around the alignment and handling cases when there aren't another full 16 bytes to process in the input data. The code I tested is x64 compiled under release using Visual Studio 2013 running on a 4+ GHz Core i7.

Here are my results:

ExpandOrig:               56.234 seconds  // From asker's original question
ExpandSmallLUT:           30.209 seconds  // From Dmitry's answer
ExpandLookupSmallOneLUT:  33.689 seconds  // from Dmitry's answer
ExpandLookupLarge:        51.312 seconds  // A straightforward lookup table
ExpandAShelly:            43.829 seconds  // From AShelly's answer
ExpandAShellyMulOp:       43.580 seconds  // AShelly's answer with an optimization
ExpandSSE4:               17.854 seconds  // My original SSE4 answer
ExpandSSE4Unroll:         17.405 seconds  // My original SSE4 answer with loop unrolling
ExpandSSE2:               17.281 seconds  // My current SSE2 answer
ExpandSSE2Unroll:         17.152 seconds  // My current SSE2 answer with loop unrolling
Run Code Online (Sandbox Code Playgroud)

In the test results above you'll see I included the asker's code, three lookup table implementations including the small lookup table implementation proposed in Dmitry's answer. AShelly's solution is included too, as well as a version with an optimization I made (an operation can be eliminated). I included my original SSE4 implementation, as well as a superior SSE2 version I made later (now reflected as the answer), as well as unrolled versions of both since they were the fastest here, and I wanted to see how much unrolling sped them up. I also included an SSE4 implementation of AShelly's answer.

So far I have to declare myself the winner. But the source is below, so anyone can test it out on their platform, and include their own solution into the testing to see if they've made a solution that's even faster.

#define DATA_SIZE_IN  ((unsigned)(1024 * 1024 * 128))
#define DATA_SIZE_OUT ((unsigned)(2 * DATA_SIZE_IN))
#define RERUN_COUNT   500

#include <cstdlib>
#include <ctime>
#include <iostream>
#include <utility>
#include <emmintrin.h> // SSE2
#include <tmmintrin.h> // SSSE3
#include <smmintrin.h> // SSE4

void ExpandOrig(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // Read in data
    u  = *(unsigned const*)in;
    v  = u >> 16;
    u &= 0x0000FFFF;

    // Do computation
    u  =   (u & 0x00FF) << 4
         | (u & 0x000F)
         | (u & 0x0FF0) << 8
         | (u & 0xFF00) << 12
         | (u & 0xF000) << 16;
    v  =   (v & 0x00FF) << 4
         | (v & 0x000F)
         | (v & 0x0FF0) << 8
         | (v & 0xFF00) << 12
         | (v & 0xF000) << 16;

    // Store data
    *(unsigned*)(out)      = u;
    *(unsigned*)(out + 4)  = v;
    in                    += 4;
    out                   += 8;
  } while (in != past);
}

unsigned LutLo[256],
         LutHi[256];
void MakeLutLo(void) {
  for (unsigned i = 0, x; i < 256; ++i) {
    x        = i;
    x        = ((x & 0xF0) << 4) | (x & 0x0F);
    x       |= (x << 4);
    LutLo[i] = x;
  }
}
void MakeLutHi(void) {
  for (unsigned i = 0, x; i < 256; ++i) {
    x        = i;
    x        = ((x & 0xF0) << 20) | ((x & 0x0F) << 16);
    x       |= (x << 4);
    LutHi[i] = x;
  }
}

void ExpandLookupSmall(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // Read in data
    u  = *(unsigned const*)in;
    v  = u >> 16;
    u &= 0x0000FFFF;

    // Do computation
    u = LutHi[u >> 8] | LutLo[u & 0xFF];
    v = LutHi[v >> 8] | LutLo[v & 0xFF];

    // Store data
    *(unsigned*)(out)      = u;
    *(unsigned*)(out + 4)  = v;
    in                    += 4;
    out                   += 8;
  } while (in != past);
}

void ExpandLookupSmallOneLUT(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // Read in data
    u = *(unsigned const*)in;
    v = u >> 16;
    u &= 0x0000FFFF;

    // Do computation
    u = ((LutLo[u >> 8] << 16) | LutLo[u & 0xFF]);
    v = ((LutLo[v >> 8] << 16) | LutLo[v & 0xFF]);

    // Store data
    *(unsigned*)(out) = u;
    *(unsigned*)(out + 4) = v;
    in  += 4;
    out += 8;
  } while (in != past);
}

unsigned LutLarge[256 * 256];
void MakeLutLarge(void) {
  for (unsigned i = 0; i < (256 * 256); ++i)
    LutLarge[i] = LutHi[i >> 8] | LutLo[i & 0xFF];
}

void ExpandLookupLarge(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // Read in data
    u  = *(unsigned const*)in;
    v  = u >> 16;
    u &= 0x0000FFFF;

    // Do computation
    u = LutLarge[u];
    v = LutLarge[v];

    // Store data
    *(unsigned*)(out)      = u;
    *(unsigned*)(out + 4)  = v;
    in                    += 4;
    out                   += 8;
  } while (in != past);
}

void ExpandAShelly(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v, w, x;
  do {
    // Read in data
    u  = *(unsigned const*)in;
    v  = u >> 16;
    u &= 0x0000FFFF;

    // Do computation
    w  = (((u & 0xF0F) * 0x101) & 0xF000F) + (((u & 0xF0F0) * 0x1010) & 0xF000F00);
    x  = (((v & 0xF0F) * 0x101) & 0xF000F) + (((v & 0xF0F0) * 0x1010) & 0xF000F00);
    w += w * 0x10;
    x += x * 0x10;

    // Store data
    *(unsigned*)(out)      = w;
    *(unsigned*)(out + 4)  = x;
    in                    += 4;
    out                   += 8;
  } while (in != past);
}

void ExpandAShellyMulOp(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  unsigned u, v;
  do {
    // Read in data
    u = *(unsigned const*)in;
    v = u >> 16;
    u &= 0x0000FFFF;

    // Do computation
    u = ((((u & 0xF0F) * 0x101) & 0xF000F) + (((u & 0xF0F0) * 0x1010) & 0xF000F00)) * 0x11;
    v = ((((v & 0xF0F) * 0x101) & 0xF000F) + (((v & 0xF0F0) * 0x1010) & 0xF000F00)) * 0x11;

    // Store data
    *(unsigned*)(out) = u;
    *(unsigned*)(out + 4) = v;
    in += 4;
    out += 8;
  } while (in != past);
}

void ExpandSSE4(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const mask0 = _mm_set1_epi16((short)0x8000),
                mask1 = _mm_set1_epi8(0x0F),
                mul = _mm_set1_epi16(0x0011);
  __m128i       u, v, w, x;
  do {
    // Read input into low 8 bytes of u and v
    u = _mm_load_si128((__m128i const*)in);

    v = _mm_unpackhi_epi8(u, u);      // Expand each single byte to two bytes
    u = _mm_unpacklo_epi8(u, u);      // Do it again for v
    w = _mm_srli_epi16(u, 4);         // Copy the value into w and shift it right half a byte
    x = _mm_srli_epi16(v, 4);         // Do it again for v
    u = _mm_blendv_epi8(u, w, mask0); // Select odd bytes from w, and even bytes from v, giving the the desired value in the upper nibble of each byte
    v = _mm_blendv_epi8(v, x, mask0); // Do it again for v
    u = _mm_and_si128(u, mask1);      // Clear the all the upper nibbles
    v = _mm_and_si128(v, mask1);      // Do it again for v
    u = _mm_mullo_epi16(u, mul);      // Multiply each 16-bit value by 0x0011 to duplicate the lower nibble in the upper nibble of each byte
    v = _mm_mullo_epi16(v, mul);      // Do it again for v

    // Write output
    _mm_store_si128((__m128i*)(out     ), u);
    _mm_store_si128((__m128i*)(out + 16), v);
    in  += 16;
    out += 32;
  } while (in != past);
}

void ExpandSSE4Unroll(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const mask0  = _mm_set1_epi16((short)0x8000),
                mask1  = _mm_set1_epi8(0x0F),
                mul    = _mm_set1_epi16(0x0011);
  __m128i       u0, v0, w0, x0,
                u1, v1, w1, x1,
                u2, v2, w2, x2,
                u3, v3, w3, x3;
  do {
    // Read input into low 8 bytes of u and v
    u0 = _mm_load_si128((__m128i const*)(in     ));
    u1 = _mm_load_si128((__m128i const*)(in + 16));
    u2 = _mm_load_si128((__m128i const*)(in + 32));
    u3 = _mm_load_si128((__m128i const*)(in + 48));

    v0 = _mm_unpackhi_epi8(u0, u0);      // Expand each single byte to two bytes
    u0 = _mm_unpacklo_epi8(u0, u0);      // Do it again for v
    v1 = _mm_unpackhi_epi8(u1, u1);      // Do it again
    u1 = _mm_unpacklo_epi8(u1, u1);      // Again for u1
    v2 = _mm_unpackhi_epi8(u2, u2);      // Again for v1
    u2 = _mm_unpacklo_epi8(u2, u2);      // Again for u2
    v3 = _mm_unpackhi_epi8(u3, u3);      // Again for v2
    u3 = _mm_unpacklo_epi8(u3, u3);      // Again for u3
    w0 = _mm_srli_epi16(u0, 4);          // Copy the value into w and shift it right half a byte
    x0 = _mm_srli_epi16(v0, 4);          // Do it again for v
    w1 = _mm_srli_epi16(u1, 4);          // Again for u1
    x1 = _mm_srli_epi16(v1, 4);          // Again for v1
    w2 = _mm_srli_epi16(u2, 4);          // Again for u2
    x2 = _mm_srli_epi16(v2, 4);          // Again for v2
    w3 = _mm_srli_epi16(u3, 4);          // Again for u3
    x3 = _mm_srli_epi16(v3, 4);          // Again for v3
    u0 = _mm_blendv_epi8(u0, w0, mask0); // Select even bytes from w, and odd bytes from v, giving the the desired value in the upper nibble of each byte
    v0 = _mm_blendv_epi8(v0, x0, mask0); // Do it again for v
    u1 = _mm_blendv_epi8(u1, w1, mask0); // Again for u1
    v1 = _mm_blendv_epi8(v1, x1, mask0); // Again for v1
    u2 = _mm_blendv_epi8(u2, w2, mask0); // Again for u2
    v2 = _mm_blendv_epi8(v2, x2, mask0); // Again for v2
    u3 = _mm_blendv_epi8(u3, w3, mask0); // Again for u3
    v3 = _mm_blendv_epi8(v3, x3, mask0); // Again for v3
    u0 = _mm_and_si128(u0, mask1);       // Clear the all the upper nibbles
    v0 = _mm_and_si128(v0, mask1);       // Do it again for v
    u1 = _mm_and_si128(u1, mask1);       // Again for u1
    v1 = _mm_and_si128(v1, mask1);       // Again for v1
    u2 = _mm_and_si128(u2, mask1);       // Again for u2
    v2 = _mm_and_si128(v2, mask1);       // Again for v2
    u3 = _mm_and_si128(u3, mask1);       // Again for u3
    v3 = _mm_and_si128(v3, mask1);       // Again for v3
    u0 = _mm_mullo_epi16(u0, mul);       // Multiply each 16-bit value by 0x0011 to duplicate the lower nibble in the upper nibble of each byte
    v0 = _mm_mullo_epi16(v0, mul);       // Do it again for v
    u1 = _mm_mullo_epi16(u1, mul);       // Again for u1
    v1 = _mm_mullo_epi16(v1, mul);       // Again for v1
    u2 = _mm_mullo_epi16(u2, mul);       // Again for u2
    v2 = _mm_mullo_epi16(v2, mul);       // Again for v2
    u3 = _mm_mullo_epi16(u3, mul);       // Again for u3
    v3 = _mm_mullo_epi16(v3, mul);       // Again for v3

    // Write output
    _mm_store_si128((__m128i*)(out      ), u0);
    _mm_store_si128((__m128i*)(out +  16), v0);
    _mm_store_si128((__m128i*)(out +  32), u1);
    _mm_store_si128((__m128i*)(out +  48), v1);
    _mm_store_si128((__m128i*)(out +  64), u2);
    _mm_store_si128((__m128i*)(out +  80), v2);
    _mm_store_si128((__m128i*)(out +  96), u3);
    _mm_store_si128((__m128i*)(out + 112), v3);
    in  += 64;
    out += 128;
  } while (in != past);
}

void ExpandSSE2(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const mask = _mm_set1_epi16((short)0xF00F),
                mul0 = _mm_set1_epi16(0x0011),
                mul1 = _mm_set1_epi16(0x1000);
  __m128i       u, v;
  do {
    // Read input into low 8 bytes of u and v
    u = _mm_load_si128((__m128i const*)in);

    v = _mm_unpackhi_epi8(u, u);      // Expand each single byte to two bytes
    u = _mm_unpacklo_epi8(u, u);      // Do it again for v

    u = _mm_and_si128(u, mask);
    v = _mm_and_si128(v, mask);
    u = _mm_mullo_epi16(u, mul0);
    v = _mm_mullo_epi16(v, mul0);
    u = _mm_mulhi_epu16(u, mul1);     // This can also be done with a right shift of 4 bits, but this seems to mesure faster
    v = _mm_mulhi_epu16(v, mul1);
    u = _mm_mullo_epi16(u, mul0);
    v = _mm_mullo_epi16(v, mul0);

    // write output
    _mm_store_si128((__m128i*)(out     ), u);
    _mm_store_si128((__m128i*)(out + 16), v);
    in  += 16;
    out += 32;
  } while (in != past);
}

void ExpandSSE2Unroll(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const mask = _mm_set1_epi16((short)0xF00F),
                mul0 = _mm_set1_epi16(0x0011),
                mul1 = _mm_set1_epi16(0x1000);
  __m128i       u0, v0,
                u1, v1;
  do {
    // Read input into low 8 bytes of u and v
    u0 = _mm_load_si128((__m128i const*)(in     ));
    u1 = _mm_load_si128((__m128i const*)(in + 16));

    v0 = _mm_unpackhi_epi8(u0, u0);      // Expand each single byte to two bytes
    u0 = _mm_unpacklo_epi8(u0, u0);      // Do it again for v
    v1 = _mm_unpackhi_epi8(u1, u1);      // Do it again
    u1 = _mm_unpacklo_epi8(u1, u1);      // Again for u1

    u0 = _mm_and_si128(u0, mask);
    v0 = _mm_and_si128(v0, mask);
    u1 = _mm_and_si128(u1, mask);
    v1 = _mm_and_si128(v1, mask);

    u0 = _mm_mullo_epi16(u0, mul0);
    v0 = _mm_mullo_epi16(v0, mul0);
    u1 = _mm_mullo_epi16(u1, mul0);
    v1 = _mm_mullo_epi16(v1, mul0);

    u0 = _mm_mulhi_epu16(u0, mul1);
    v0 = _mm_mulhi_epu16(v0, mul1);
    u1 = _mm_mulhi_epu16(u1, mul1);
    v1 = _mm_mulhi_epu16(v1, mul1);

    u0 = _mm_mullo_epi16(u0, mul0);
    v0 = _mm_mullo_epi16(v0, mul0);
    u1 = _mm_mullo_epi16(u1, mul0);
    v1 = _mm_mullo_epi16(v1, mul0);

    // write output
    _mm_store_si128((__m128i*)(out     ), u0);
    _mm_store_si128((__m128i*)(out + 16), v0);
    _mm_store_si128((__m128i*)(out + 32), u1);
    _mm_store_si128((__m128i*)(out + 48), v1);

    in  += 32;
    out += 64;
  } while (in != past);
}

void ExpandAShellySSE4(unsigned char const *in, unsigned char const *past, unsigned char *out) {
  __m128i const zero      = _mm_setzero_si128(),
                v0F0F     = _mm_set1_epi32(0x0F0F),
                vF0F0     = _mm_set1_epi32(0xF0F0),
                v0101     = _mm_set1_epi32(0x0101),
                v1010     = _mm_set1_epi32(0x1010),
                v000F000F = _mm_set1_epi32(0x000F000F),
                v0F000F00 = _mm_set1_epi32(0x0F000F00),
                v0011 = _mm_set1_epi32(0x0011);
  __m128i       u, v, w, x;
  do {
    // Read in data
    u = _mm_load_si128((__m128i const*)in);
    v = _mm_unpackhi_epi16(u, zero);
    u = _mm_unpacklo_epi16(u, zero);

    // original source: ((((a & 0xF0F) * 0x101) & 0xF000F) + (((a & 0xF0F0) * 0x1010) & 0xF000F00)) * 0x11;
    w = _mm_and_si128(u, v0F0F);
    x = _mm_and_si128(v, v0F0F);
    u = _mm_and_si128(u, vF0F0);
    v = _mm_and_si128(v, vF0F0);
    w = _mm_mullo_epi32(w, v0101); // _mm_mullo_epi32 is what makes this require SSE4 instead of SSE2
    x = _mm_mullo_epi32(x, v0101);
    u = _mm_mullo_epi32(u, v1010);
    v = _mm_mullo_epi32(v, v1010);
    w = _mm_and_si128(w, v000F000F);
    x = _mm_and_si128(x, v000F000F);
    u = _mm_and_si128(u, v0F000F00);
    v = _mm_and_si128(v, v0F000F00);
    u = _mm_add_epi32(u, w);
    v = _mm_add_epi32(v, x);
    u = _mm_mullo_epi32(u, v0011);
    v = _mm_mullo_epi32(v, v0011);

    // write output
    _mm_store_si128((__m128i*)(out     ), u);
    _mm_store_si128((__m128i*)(out + 16), v);
    in  += 16;
    out += 32;
  } while (in != past);
}

int main() {
  unsigned char *const indat   = new unsigned char[DATA_SIZE_IN ],
                *const outdat0 = new unsigned char[DATA_SIZE_OUT],
                *const outdat1 = new unsigned char[DATA_SIZE_OUT],
                *      curout  = outdat0,
                *      lastout = outdat1,
                *      place;
  unsigned             start,
                       stop;

  place = indat + DATA_SIZE_IN - 1;
  do {
    *place = (unsigned char)rand();
  } while (place-- != indat);
  MakeLutLo();
  MakeLutHi();
  MakeLutLarge();

  for (unsigned testcount = 0; testcount < 1000; ++testcount) {
    // Solution posted by the asker
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandOrig(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandOrig:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);

    // Dmitry's small lookup table solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandLookupSmall(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSmallLUT:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // Dmitry's small lookup table solution using only one lookup table
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandLookupSmallOneLUT(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandLookupSmallOneLUT:\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // Large lookup table solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandLookupLarge(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandLookupLarge:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // AShelly's Interleave bits by Binary Magic Numbers solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandAShelly(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandAShelly:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // AShelly's Interleave bits by Binary Magic Numbers solution optimizing out an addition
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandAShellyMulOp(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandAShellyMulOp:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // My SSE4 solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandSSE4(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSSE4:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // My SSE4 solution unrolled
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandSSE4Unroll(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSSE4Unroll:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // My SSE2 solution
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandSSE2(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSSE2:\t\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // My SSE2 solution unrolled
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandSSE2Unroll(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandSSE2Unroll:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;

    // AShelly's Interleave bits by Binary Magic Numbers solution implemented using SSE2
    start = clock();
    for (unsigned rerun = 0; rerun < RERUN_COUNT; ++rerun)
      ExpandAShellySSE4(indat, indat + DATA_SIZE_IN, curout);
    stop = clock();
    std::cout << "ExpandAShellySSE4:\t\t" << (((stop - start) / 1000) / 60) << ':' << (((stop - start) / 1000) % 60) << ":." << ((stop - start) % 1000) << std::endl;

    std::swap(curout, lastout);
    if (memcmp(outdat0, outdat1, DATA_SIZE_OUT))
      std::cout << "INCORRECT OUTPUT" << std::endl;
  }

  delete[] indat;
  delete[] outdat0;
  delete[] outdat1;
  return 0;
}
Run Code Online (Sandbox Code Playgroud)

NOTE:

I had an SSE4 implementation here initially. I found a way to implement this using SSE2, which is better because it will run on more platforms. The SSE2 implementation is also faster. So, the solution presented at the top is now the SSE2 implementation and not the SSE4 one. The SSE4 implementation can still be seen in the performance tests or in the edit history.

  • 令人惊叹的工作.这值得更多的赞成 (4认同)
  • 我在我的系统上运行了性能测试(后来的Core 2,64位linux,使用-O3 -march = native的gcc 4.8.1).虽然你的SSE4代码是最快的,但许多位操作解决方案几乎一样快(在几个百分点内)......显然,当启用相关功能和优化时,gcc可以自行地对代码进行矢量化. (3认同)
  • 不错的工作.谢谢你做了测试.如果我从他们那里学到了一件事:我不想写SSE代码,除非我绝对必须拥有最后的性能:)这肯定不容易阅读. (2认同)
  • @AShelly:没有人,特别是如果可移植性是一个问题.有各种各样的SIMD指令集...... SSE仅在x86世界中有效.您必须处理PPC上的AltiVec(VMX)和ARM上的NEON等等.直接与指令集扩展接口在任何架构上都是丑陋的;) (2认同)

Dmi*_*tri 21

我不确定最有效的方式是什么,但这有点短:

#include <stdio.h>

int main()
{
  unsigned x = 0x1234;

  x = (x << 8) | x;
  x = ((x & 0x00f000f0) << 4) | (x & 0x000f000f);
  x = (x << 4) | x;

  printf("0x1234 -> 0x%08x\n",x);

  return 0;
}
Run Code Online (Sandbox Code Playgroud)

如果您需要按照编辑中的建议重复且非常快速地执行此操作,则可以考虑生成查找表并使用它.以下函数动态分配和初始化这样的表:

unsigned *makeLookupTable(void)
{
  unsigned *tbl = malloc(sizeof(unsigned) * 65536);
  if (!tbl) return NULL;
  int i;
  for (i = 0; i < 65536; i++) {
    unsigned x = i;
    x |= (x << 8);
    x = ((x & 0x00f000f0) << 4) | (x & 0x000f000f);
    x |= (x << 4);

    /* Uncomment next line to invert the high byte as mentioned in the edit. */
    /* x = x ^ 0xff000000; */

    tbl[i] = x;
  }
  return tbl;
}
Run Code Online (Sandbox Code Playgroud)

之后,每次转换都是这样的:

result = lookuptable[input];
Run Code Online (Sandbox Code Playgroud)

..或者可能:

result = lookuptable[input & 0xffff];
Run Code Online (Sandbox Code Playgroud)

或者可以使用更小,更易于缓存的查找表(或对),每个查找高字节和低字节(如注释中的@LưuVĩnhPhúc所述).在这种情况下,表生成代码可能是:

unsigned *makeLookupTableLow(void)
{
  unsigned *tbl = malloc(sizeof(unsigned) * 256);
  if (!tbl) return NULL;
  int i;
  for (i = 0; i < 256; i++) {
    unsigned x = i;
    x = ((x & 0xf0) << 4) | (x & 0x0f);
    x |= (x << 4);
    tbl[i] = x;
  }
  return tbl;
}
Run Code Online (Sandbox Code Playgroud)

......和一个可选的第二个表格:

unsigned *makeLookupTableHigh(void)
{
  unsigned *tbl = malloc(sizeof(unsigned) * 256);
  if (!tbl) return NULL;
  int i;
  for (i = 0; i < 256; i++) {
    unsigned x = i;
    x = ((x & 0xf0) << 20) | ((x & 0x0f) << 16);
    x |= (x << 4);

    /* uncomment next line to invert high byte */
    /* x = x ^ 0xff000000; */

    tbl[i] = x;
  }
  return tbl;
}
Run Code Online (Sandbox Code Playgroud)

...并使用两个表转换值:

result = hightable[input >> 8] | lowtable[input & 0xff];
Run Code Online (Sandbox Code Playgroud)

......或者一个(只是上面的低表):

result = (lowtable[input >> 8] << 16) | lowtable[input & 0xff];
result ^= 0xff000000; /* to invert high byte */
Run Code Online (Sandbox Code Playgroud)

如果值(alpha?)的上半部分没有太大变化,即使单个大表也可能表现良好,因为连续查找在表中会更加接近.


我拿着性能测试代码@Apriori贴,做了一些调整,并增加了,他原本不包括其他响应测试......然后编译它的三个版本不同的设置.一个是64位的代码以SSE4.1启用,这里的编译器可以利用上交所的优化......然后两个32位版本,一个与SSE,一个没有.虽然这三个都是在相同的最新处理器上运行,但结果显示最佳解决方案如何根据处理器功能进行更改:

                           64b SSE4.1  32b SSE4.1  32b no SSE
-------------------------- ----------  ----------  ----------
ExpandOrig           time:  3.502 s     3.501 s     6.260 s
ExpandLookupSmall    time:  3.530 s     3.997 s     3.996 s
ExpandLookupLarge    time:  3.434 s     3.419 s     3.427 s
ExpandIsalamon       time:  3.654 s     3.673 s     8.870 s
ExpandIsalamonOpt    time:  3.784 s     3.720 s     8.719 s
ExpandChronoKitsune  time:  3.658 s     3.463 s     6.546 s
ExpandEvgenyKluev    time:  6.790 s     7.697 s    13.383 s
ExpandIammilind      time:  3.485 s     3.498 s     6.436 s
ExpandDmitri         time:  3.457 s     3.477 s     5.461 s
ExpandNitish712      time:  3.574 s     3.800 s     6.789 s
ExpandAdamLiss       time:  3.673 s     5.680 s     6.969 s
ExpandAShelly        time:  3.524 s     4.295 s     5.867 s
ExpandAShellyMulOp   time:  3.527 s     4.295 s     5.852 s
ExpandSSE4           time:  3.428 s
ExpandSSE4Unroll     time:  3.333 s
ExpandSSE2           time:  3.392 s
ExpandSSE2Unroll     time:  3.318 s
ExpandAShellySSE4    time:  3.392 s
Run Code Online (Sandbox Code Playgroud)

可执行文件是在64位Linux上使用gcc 4.8.1编译的-m64 -O3 -march=core2 -msse4.1,分别使用-m32 -O3 -march=core2 -msse4.1-m32 -O3 -march=core2 -mno-sse.@先验的SSE测试省略32位构建(坠毁,机上32位与支持SSE,显然不会与SSE残疾人工作).

其中所做的调整是使用实际的图像数据,而不是随机值,极大地提高了大型查找表的性能,但做了别人相差不大(具有透明背景的物体的照片).

基本上,当SSE不可用(或未使用)时,查找表通过滑坡获胜......并且手动编码的SSE解决方案获胜.但是,值得注意的是,当编译器可以使用SSE进行优化时,大多数位操作解决方案几乎与手动编码的SSE一样快 - 仍然较慢,但只是略有下降.

  • @Lundin设置表是一次性的malloc,它不是转换的一部分. (14认同)
  • @Lundin它只需要在运行时*一次完成*虽然......在程序启动时只需要几分之一秒.静态const表也可以工作,但是你想要编写一个程序来生成给定表大小的源代码. (11认同)
  • 只需256个条目查找表即可.您可以单独查找低字节和高字节.太大的表可能不适合缓存,性能可能更差 (6认同)

ASh*_*lly 12

这是另一次尝试,使用了八个操作:

b = (((c & 0x0F0F) * 0x0101) & 0x00F000F) + 
    (((c & 0xF0F0) * 0x1010) & 0xF000F00);
b += b * 0x10;

printf("%x\n",b); //Shows '0x11223344'
Run Code Online (Sandbox Code Playgroud)

*注意,这篇文章最初包含了完全不同的代码,基于Sean Anderson的bithacks页面中Binary Magic Numbers的Interleave位.但那并不是OP所要求的.所以它已被删除.以下大多数评论都提到了缺失的版本.

  • 建议的代码将复制**8位值的**单位**.OP要求复制16位值的**4位半字节**. (4认同)
  • 那里有一些严重丑陋的代码.无论结果多么有效,看起来都是过早的优化.为了便于阅读,维护和无错误的代码,我认真考虑使用效率较低的代码. (3认同)
  • @Lundin已经调试了代码,只是使用,如果有任何问题,它就在你的代码中.您始终可以在代码中留下注释来描述函数的功能以及从哪里获取功能 (2认同)

tru*_*cks 6

我想将此链接添加到答案池中,因为我认为在讨论优化,记住我们正在运行的硬件以及编译所述平台代码的技术时非常重要.

博客文章使用CPU管道来研究如何优化CPU流水线的一组代码.它实际上显示了一个例子,他试图将数学简化为最少的实际数学运算,但就时间而言,它是最优解的FAR.我在这里看到了几个答案,并且它们可能是正确的,它们可能不是.要知道的唯一方法是实际测量特定代码片段从开始到结束的时间,与其他代码进行比较.阅读此博客; 它非常有趣.

我想我应该提一下,我在这个特殊情况下不会在这里放任何代码,除非我真的尝试了多次尝试,实际上通过多次尝试特别快.


Pap*_*ter 5

我认为Dimitri建议的查找表方法是一个不错的选择,但我建议更进一步,在编译时生成表格; 在编译时完成工作显然会减少执行时间.

首先,我们使用任何建议的方法创建编译时值:

constexpr unsigned int transform1(unsigned int x)
{
  return ((x << 8) | x);
}

constexpr unsigned int transform2(unsigned int x)
{
  return (((x & 0x00f000f0) << 4) | (x & 0x000f000f));
}

constexpr unsigned int transform3(unsigned int x)
{
  return ((x << 4) | x);
}

constexpr unsigned int transform(unsigned int x)
{
  return transform3(transform2(transform1(x)));
}

// Dimitri version, using constexprs
template <unsigned int argb> struct aarrggbb_dimitri
{
  static const unsigned int value = transform(argb);
};

// Adam Liss version
template <unsigned int argb> struct aarrggbb_adamLiss
{
  static const unsigned int value =
    (argb & 0xf000) * 0x11000 +
    (argb & 0x0f00) * 0x01100 +
    (argb & 0x00f0) * 0x00110 +
    (argb & 0x000f) * 0x00011;
};
Run Code Online (Sandbox Code Playgroud)

然后,我们使用我们可用的任何方法创建编译时查找表,我将希望使用C++ 14 整数序列,但我不知道OP将使用哪个编译器.所以另一种可能的方法是使用一个非常难看的宏:

#define EXPAND16(x) aarrggbb<x + 0>::value, \
aarrggbb<x + 1>::value, \
aarrggbb<x + 2>::value, \
aarrggbb<x + 3>::value, \
aarrggbb<x + 4>::value, \
aarrggbb<x + 5>::value, \
aarrggbb<x + 6>::value, \
... and so on

#define EXPAND EXPAND16(0), \
EXPAND16(0x10), \
EXPAND16(0x20), \
EXPAND16(0x30), \
EXPAND16(0x40), \
... and so on

... and so on
Run Code Online (Sandbox Code Playgroud)

在这里演示.

PS:Adam Liss方法可以在没有C++ 11的情况下使用.


Evg*_*uev 5

如果乘法很便宜且可以使用64位算术,则可以使用以下代码:

uint64_t x = 0x1234;
x *= 0x0001000100010001ull;
x &= 0xF0000F0000F0000Full;
x *= 0x0000001001001001ull;
x &= 0xF0F0F0F000000000ull;
x = (x >> 36) * 0x11;
std::cout << std::hex << x << '\n';
Run Code Online (Sandbox Code Playgroud)

事实上,它使用与AShelly原始尝试相同的想法.