如何使用intel内在函数从256个向量中提取8个整数？

Question

如何使用intel内在函数从256个向量中提取8个整数？

我正在尝试使用256位向量(Intel intrinsics - AVX)来提高代码的性能.

我有一个支持SSE1到SSE4.2和AVX/AVX2扩展的I7 Gen.4(Haswell架构)处理器.

这是我正在尝试增强的代码片段:

/* code snipet */
kfac1 = kfac  + factor;  /* 7 cycles for 7 additions */
kfac2 = kfac1 + factor;
kfac3 = kfac2 + factor;
kfac4 = kfac3 + factor;
kfac5 = kfac4 + factor;
kfac6 = kfac5 + factor;
kfac7 = kfac6 + factor;

k1fac1 = k1fac  + factor1;  /* 7 cycles for 7 additions */
k1fac2 = k1fac1 + factor1;
k1fac3 = k1fac2 + factor1;
k1fac4 = k1fac3 + factor1;
k1fac5 = k1fac4 + factor1;
k1fac6 = k1fac5 + factor1;
k1fac7 = k1fac6 + factor1;

k2fac1 = k2fac  + factor2;  /* 7 cycles for 7 additions */
k2fac2 = k2fac1 + factor2;
k2fac3 = k2fac2 + factor2;
k2fac4 = k2fac3 + factor2;
k2fac5 = k2fac4 + factor2;
k2fac6 = k2fac5 + factor2;
k2fac7 = k2fac6 + factor2;
/* code snipet */

Run Code Online (Sandbox Code Playgroud)

从英特尔手册中,我发现了这一点.

整数加法ADD需要1个周期(延迟).
8个整数(32位)的向量也需要1个周期.

所以我试过这样做:

fac  = _mm256_set1_epi32 (factor )
fac1 = _mm256_set1_epi32 (factor1)
fac2 = _mm256_set1_epi32 (factor2)

v1   = _mm256_set_epi32 (0,kfac6,kfac5,kfac4,kfac3,kfac2,kfac1,kfac)
v2   = _mm256_set_epi32 (0,k1fac6,k1fac5,k1fac4,k1fac3,k1fac2,k1fac1,k1fac)
v3   = _mm256_set_epi32 (0,k2fac6,k2fac5,k2fac4,k2fac3,k2fac2,k2fac1,k2fac)

res1 = _mm256_add_epi32 (v1,fac) ////////////////////
res2 = _mm256_add_epi32 (v2,fa1) // just 3 cycles  //
res3 = _mm256_add_epi32 (v3,fa2) ////////////////////

Run Code Online (Sandbox Code Playgroud)

但问题是这些因素将被用作表索引(table [kfac] ...).所以我必须再次将因子提取为单独的整数.我想知道是否有任何可行的方法呢？

Answer 1

Pet*_*des 4

智能编译器可以table+factor进入寄存器并使用索引寻址模式来获取table+factor+k1fac6地址。检查 asm，如果编译器没有为您执行此操作，请尝试更改源代码以手持编译器：

const int *tf = table + factor;
const int *tf2 = table + factor2;   // could be lea rdx, [rax+rcx*4]  or something.

...

foo = tf[kfac2];
bar = tf2[k2fac6];     // could be  mov r12, [rdx + rdi*4]

Run Code Online (Sandbox Code Playgroud)

但回答你提出的问题：

当发生如此多的独立添加时，延迟并不是什么大问题。Haswell 上每个时钟4 个标量指令的吞吐量add更为相关。

如果k1fac2等等已经在连续内存中，那么使用 SIMD 可能是值得的。否则，所有将它们移入/移出矢量寄存器的改组和数据传输绝对不值得。（即编译器发出来实现_mm256_set_epi32 (0,kfac6,kfac5,kfac4,kfac3,kfac2,kfac1,kfac).

您可以通过使用 AVX2 收集表加载来避免将索引放回到整数寄存器中。但 Haswell 上的聚集速度很慢，所以可能不值得。布罗德韦尔也许值得。

在 Skylake 上，收集速度很快，因此如果您可以对 LUT 结果执行任何操作，都可以进行 SIMD 处理，那就太好了。如果您需要将所有收集结果提取回单独的整数寄存器，则可能不值得。

如果您确实需要从 a 中提取 8x 32 位整数__m256i到整数寄存器中，那么您有三种主要的策略选择：

将向量存储到 tmp 数组并进行标量加载
ALU 洗牌指令如pextrd( _mm_extract_epi32)。用于_mm256_extracti128_si256将高车道变成单独的车道__m128i。
两种策略的混合（例如，将高 128 存储到内存，同时在低半部分使用 ALU 内容）。

根据周围的代码，这三个中的任何一个都可能在 Haswell 上是最佳的。

pextrd r32, xmm, imm8 is 2 uops on Haswell, with one of them needing the shuffle unit on port5. That's a lot of shuffle uops, so a pure ALU strategy is only going to be good if your code is bottlenecked on L1d cache throughput. (Not the same thing as memory bandwidth). movd r32, xmm is only 1 uop, and compilers do know to use that when compiling _mm_extract_epi32(vec, 0), but you can also write int foo = _mm_cvtsi128_si32(vec) to make it explicit and remind yourself that the bottom element can be accessed more efficiently.

Store/reload has good throughput. Intel SnB-family CPUs including Haswell can run two loads per clock, and IIRC store-forwarding works from an aligned 32-byte store to any 4-byte element of it. But make sure it's an aligned store, e.g. into _Alignas(32) int tmp[8], or into a union between an __m256i and an int array. You could still store into the int array instead of the __m256i member to avoid union type-punning while still having the array aligned, but it's easiest to just use C++11 alignas or C11 _Alignas.

 _Alignas(32) int tmp[8];
 _mm256_store_si256((__m256i*)tmp, vec);
 ...
 foo2 = tmp[2];

Run Code Online (Sandbox Code Playgroud)

However, the problem with store/reload is latency. Even the first result won't be ready for 6 cycles after the store-data is ready.

A mixed strategy gives you the best of both worlds: ALU to extract the first 2 or 3 elements lets execution get started on whatever code uses them, hiding the store-forwarding latency of the store/reload.

 _Alignas(32) int tmp[8];
 _mm256_store_si256((__m256i*)tmp, vec);

 __m128i lo = _mm256_castsi256_si128(vec);  // This is free, no instructions
 int foo0 = _mm_cvtsi128_si32(lo);
 int foo1 = _mm_extract_epi32(lo, 1);

 foo2 = tmp[2];
 // rest of foo3..foo7 also loaded from tmp[]

 // Then use foo0..foo7

Run Code Online (Sandbox Code Playgroud)

You might find that it's optimal to do the first 4 elements with pextrd, in which case you only need to store/reload the upper lane. Use vextracti128 [mem], ymm, 1:

_Alignas(16) int tmp[4];
_mm_store_si128((__m128i*)tmp,  _mm256_extracti128_si256(vec, 1));

// movd / pextrd for foo0..foo3

int foo4 = tmp[0];
...

Run Code Online (Sandbox Code Playgroud)

With fewer larger elements (e.g. 64-bit integers), a pure ALU strategy is more attractive. 6-cycle vector-store / integer-reload latency is longer than it would take to get all of the results with ALU ops, but store/reload could still be good if there's a lot of instruction-level parallelism and you bottleneck on ALU throughput instead of latency.

With more smaller elements (8 or 16-bit), store/reload is definitely attractive. Extracting the first 2 to 4 elements with ALU instructions is still good. And maybe even vmovd r32, xmm and then picking that apart with integer shift/mask instructions is good.

Your cycle-counting for the vector version is also bogus. The three _mm256_add_epi32 operations are independent, and Haswell can run two vpaddd instructions in parallel. (Skylake can run all three in a single cycle, each with 1 cycle latency.)

Superscalar pipelined out-of-order execution means there's a big difference between latency and throughput, and keeping track of dependency chains matters a lot. See http://agner.org/optimize/, and other links in the x86 tag wiki for more optimization guides.

归档时间：	8 年，6 月前
查看次数：	700 次
最近记录：	8 年，6 月前