有效地收集单个字节,以4的字节跨度分隔

Question

有效地收集单个字节,以4的字节跨度分隔

我正在尝试优化一种算法,该算法将处理可能受益于AVX SIMD指令的大量数据集.不幸的是,输入存储器布局对于所需的计算并不是最佳的.必须通过组合__m256i恰好相隔4个字节的单个字节的值来重新排序信息:

开始编辑

我的目标CPUS不支持AVX2指令,所以像@Elalfer和@PeterCordes指出的那样,我不能使用__m256i值,代码必须转换为使用__m128i值而不是)

结束编辑

内存中的DataSet布局

Byte 0   | Byte 1   | Byte 2   | Byte 3
Byte 4   | Byte 5   | Byte 6   | Byte 7
...
Byte 120 | Byte 121 | Byte 122 | Byte 123
Byte 124 | Byte 125 | Byte 126 | Byte 127

Run Code Online (Sandbox Code Playgroud)

__m256i变量中的期望值:

| Byte 0 | Byte 4 | Byte 8 |     ...     | Byte 120 | Byte 124 |

Run Code Online (Sandbox Code Playgroud)

除了这个简单的代码之外,是否有更有效的方法来收集和重新排列跨步数据？

union {  __m256i   reg;   uint8_t bytes[32]; } aux;
...
for( int i = 0; i < 32; i++ )
    aux.bytes[i] = data[i * 4];

Run Code Online (Sandbox Code Playgroud)

编辑:

我正在尝试优化的步骤是一个位列转换; 换句话说,某列的位(我的数据排列中的32个可能位列)应该成为单个uint32_t值,而其余位则被忽略.

我通过重新排列数据来执行转置,执行左移以将所需的位列作为每个子字节中的最高有效位,最后uint32通过_mm256_movemask_epi8()内部提取并将这些位组合成单个_t值.

Answer 1

Pet*_*des 2

我刚刚注意到编辑，其中有一个特殊情况的答案。

如果您需要对同一数据执行许多不同的位位置，那么您当前的计划是不错的。

如果您只需要 128B 内存中的一位位置（尤其是最高位位置），则可以使用_mm256_movemask_ps从每个 32b 元素获取高位。然后在 GP 寄存器中组合四个 8 位掩码。

一个好的编译器应该优化它：

vmovdqu   ymm0, [buf + 0]
; to select a different bit:
; vpslld  ymm0, ymm0, count   ; count can be imm8 or the low byte of an xmm register
vmovmskps eax, ymm0

vmovdqu   ymm0, [buf + 32]
vmovmskps ebx, ymm0

...  ecx and edx

mov       ah, bl
mov       ch, dl
shl       ecx, 16
or        eax, ecx

Run Code Online (Sandbox Code Playgroud)

仅当您测试高位时这才很好（因此您不需要在之前移动每个向量vmovmsk）。即便如此，这可能比其他解决方案有更多的指令（和代码大小）。

回答原来的问题：

与 Elalfer 的想法类似，但使用 shuffle 单元pack代替指令pshufb。此外，所有 AND 都是独立的，因此它们可以并行执行。Intel CPU 可以同时执行 3 个 AND 运算，但只能执行一次 shuffle。（或者在 Haswell 之前同时进行两次洗牌。）

// without AVX2: you won't really be able to
// do anything with a __m256i, only __m128i
// just convert everything to regular _mm_..., and leave out the final permute

mask = _mm256_set1_epi32(0x000000ff);

// same mask for all, and the load can fold into the AND
// You can write the load separately if you like, it'll still fold
L1 = and(mask, (buf))     // load and zero the bytes we don't want
L2 = and(mask, (buf+32))
L3 = and(mask, (buf+64))
L4 = and(mask, (buf+96))

// squish dwords from 2 concatenated regs down to words in 1 reg
pack12 = _mm256_packus_epi32(L1, L2);
pack34 = _mm256_packus_epi32(L3, L4);

packed = _mm256_packus_epi16(pack12, pack34);  // note the different width: zero-padded-16 -> 8

Vec = permute(packed)  // fix DWORD order in the vector (only needed for 256b version)

Vec = shift(Vec, bit_wanted)
bitvec = movemask(Vec)

    // shift:
    //  I guess word or dword granularity is fine, since byte granularity isn't available.
    //  You only care about the high bit, so it doesn't matter than you're not shifting zeroes into the bottom of each byte.

    // _mm_slli_epi32(Vec, imm8): 1 uop, 1c latency if your count is a compile-time constant.
    // _mm_sll_epi32 (Vec, _mm_cvtsi32_si128(count)): 2uop 2c latency if it's variable.

    // *not* _mm_sllv_epi32(): slower: different shift count for each element.

Run Code Online (Sandbox Code Playgroud)

如果您仅使用 AVX 执行此操作（如您所说），那么您将没有可用的 256b 整数指令。只需构建 128b 向量，并一次获得 16b 的掩模数据。最后你不需要最后的排列。

将掩码与整数指令合并：(m2<<16) | m1. 如果需要，甚至可以通过组合两个 32b 掩码来达到 64b 掩码数据。

性能：这避免了使用 AVX 单独加载指令的需要，因为如果与单寄存器寻址模式一起使用，vpand可以对内存操作数进行微熔丝。

周期1：3vpand条指令。（或者只有 2 个，如果我们正在等待地址，因为只有 2 个装载端口。）
周期2：最后一个或两个vpand，一个pack（L1，L2）
周期 3：下一个pack（L3、L4）
第 4 周期：决赛pack
// 256b AVX2：排列
周期 5：具有 imm8 计数的打包移位：1 uop，1c 延迟。
周期 6：movemask（3 个周期延迟）

延迟 = 8（SnB 及更高版本）

吞吐量：3 个洗牌 (p5)、4 个逻辑 (p015)、1 个移位 (p0)、1 个 pmovmsk (p0)。4 加载微指令。

SnB/IvB：9 ALU uops -> 3c。4 内存读取：2c。
因此，根据您使用掩码执行的操作，需要 3 个累加器来保持执行端口饱和。（天花板（8/3）= 3。）。

变量中的移位计数无法通过编译器内联/展开解析为编译时常量：延迟 = 9。并且移位会为 p1/p5 生成另一个 uop。

对于 Haswell 及更高版本的 AVX2 vpermd，.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/ - 关于内在函数的快速参考。我用它所有的时间。 (2认同)

归档时间：	10 年，3 月前
查看次数：	1031 次
最近记录：	10 年，3 月前