ARM NEON的SSE _mm_movemask_epi8等效方法

ins*_*rit 5 arm sse neon

我决定继续快速角落优化并坚持 _mm_movemask_epi8SSE指令.如何通过uint8x16_t输入为ARM Neon重写它?

Yve*_*ust 6

我知道这篇文章已经过时但我觉得提供我的(经验证的)解决方案很有用.它假定Input参数的每个通道中的所有1 /全零.

const uint8_t __attribute__ ((aligned (16))) _Powers[16]= 
    { 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 };

// Set the powers of 2 (do it once for all, if applicable)
uint8x16_t Powers= vld1q_u8(_Powers);

// Compute the mask from the input
uint64x2_t Mask= vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(vandq_u8(Input, Powers))));

// Get the resulting bytes
uint16_t Output;
vst1q_lane_u8((uint8_t*)&Output + 0, (uint8x16_t)Mask, 0);
vst1q_lane_u8((uint8_t*)&Output + 1, (uint8x16_t)Mask, 8);
Run Code Online (Sandbox Code Playgroud)

(无论如何,请注意http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47553.)

与Michael类似,诀窍是形成非空条目的索引的幂,并将它们成对地加起来三次.这必须通过增加数据大小来完成,以使每次添加的步幅加倍.您将2 x 8 8位条目减少到2 x 4 16位,然后是2 x 2 32位和2 x 1 64位.这两个数字的低字节给出了解决方案.我不认为有一种简单的方法可以将它们打包在一起,形成一个使用NEON的短值.

如果输入采用合适的形式并且可以预加载功率,则需要6条NEON指令.


Eas*_*sPi 5

这里似乎完全忽略了显而易见的解决方案。

// Use shifts to collect all of the sign bits.
// I'm not sure if this works on big endian, but big endian NEON is very
// rare.
int vmovmaskq_u8(uint8x16_t input)
{
    // Example input (half scale):
    // 0x89 FF 1D C0 00 10 99 33

    // Shift out everything but the sign bits
    // 0x01 01 00 01 00 00 01 00
    uint16x8_t high_bits = vreinterpretq_u16_u8(vshrq_n_u8(input, 7));

    // Merge the even lanes together with vsra. The '??' bytes are garbage.
    // vsri could also be used, but it is slightly slower on aarch64.
    // 0x??03 ??02 ??00 ??01
    uint32x4_t paired16 = vreinterpretq_u32_u16(
                              vsraq_n_u16(high_bits, high_bits, 7));
    // Repeat with wider lanes.
    // 0x??????0B ??????04
    uint64x2_t paired32 = vreinterpretq_u64_u32(
                              vsraq_n_u32(paired16, paired16, 14));
    // 0x??????????????4B
    uint8x16_t paired64 = vreinterpretq_u8_u64(
                              vsraq_n_u64(paired32, paired32, 28));
    // Extract the low 8 bits from each lane and join.
    // 0x4B
    return vgetq_lane_u8(paired64, 0) | ((int)vgetq_lane_u8(paired64, 8) << 8);
}
Run Code Online (Sandbox Code Playgroud)


ins*_*rit 1

经过一些测试,下面的代码看起来工作正常:

int32_t _mm_movemask_epi8_neon(uint8x16_t input)
{
    const int8_t __attribute__ ((aligned (16))) xr[8] = {-7,-6,-5,-4,-3,-2,-1,0};
    uint8x8_t mask_and = vdup_n_u8(0x80);
    int8x8_t mask_shift = vld1_s8(xr);

    uint8x8_t lo = vget_low_u8(input);
    uint8x8_t hi = vget_high_u8(input);

    lo = vand_u8(lo, mask_and);
    lo = vshl_u8(lo, mask_shift);

    hi = vand_u8(hi, mask_and);
    hi = vshl_u8(hi, mask_shift);

    lo = vpadd_u8(lo,lo);
    lo = vpadd_u8(lo,lo);
    lo = vpadd_u8(lo,lo);

    hi = vpadd_u8(hi,hi);
    hi = vpadd_u8(hi,hi);
    hi = vpadd_u8(hi,hi);

    return ((hi[0] << 8) | (lo[0] & 0xFF));
}
Run Code Online (Sandbox Code Playgroud)