wyc*_*ter 5 c++ sse x86-64 simd avx
I am currently writing a vectorized version of the QR decomposition (linear system solver) using SSE and AVX intrinsics. One of the substeps requires to select the sign of a value opposite/equal to another value. In the serial version, I used std::copysign for this. Now I want to create a similar function for SSE/AVX registers. Unfortunately, the STL uses a built-in function for that, so I can't just copy the code and turn it into SSE/AVX instructions.
I have not tried it yet (so I have no code to show for now), but my simple approach would be to create a register with all values set to -0.0 so that only the signed bit is set. Then I would use an AND operation on the source to find out if its sign is set or not. The result of this operation would either be 0.0 or -0.0, depending on the sign of the source. With the result, I would create a bitmask (using logic operations) which I can combine with the target register (using another logic operation) to set the sign accordingly.
However, I am not sure if there isn't a smarter way to solve this. If there is a built-in function for fundamental data types like floats and doubles, maybe there is also an intrinsic that I missed. Any suggestions?
Thanks in advance
EDIT:
Thanks to "chtz" for this useful link:
So basically std::copysign compiles to a sequence of 2 AND operations and a subsequent OR. I will reproduce this for SSE/AVX and post the result here in case somebody else needs it some day :)
EDIT 2:
Here is my working version:
__m128 CopySign(__m128 srcSign, __m128 srcValue)
{
// Extract the signed bit from srcSign
const __m128 mask0 = _mm_set1_ps(-0.);
__m128 tmp0 = _mm_and_ps(srcSign, mask0);
// Extract the number without sign of srcValue (abs(srcValue))
__m128 tmp1 = _mm_andnot_ps(mask0, srcValue);
// Merge signed bit with number and return
return _mm_or_ps(tmp0, tmp1);
}
Run Code Online (Sandbox Code Playgroud)
Tested it with:
__m128 a = _mm_setr_ps(1, -1, -1, 1);
__m128 b = _mm_setr_ps(-5, -11, 3, 4);
__m128 c = CopySign(a, b);
for (U32 i = 0; i < 4; ++i)
std::cout << simd::GetValue(c, i) << std::endl;
Run Code Online (Sandbox Code Playgroud)
The output is as expected:
5
-11
-3
4
Run Code Online (Sandbox Code Playgroud)
However, I also tried the version from the disassembly where
__m128 tmp1 = _mm_andnot_ps(mask0, srcValue);
Run Code Online (Sandbox Code Playgroud)
is replaced with:
const __m128 mask1 = _mm_set1_ps(NAN);
__m128 tmp1 = _mm_and_ps(srcValue, mask1);
Run Code Online (Sandbox Code Playgroud)
The results are quite strange:
4
-8
-3
4
Run Code Online (Sandbox Code Playgroud)
Depending on the chosen numbers, the number is sometimes okay and sometimes not. The sign is always correct. It seems like NaN is not !(-0.0) for some reason. I remember that I had some issues before when I tried to set register values to NaN or specific bit patterns. Maybe somebody has an idea about the origin of the problem?
EDIT 3:
正如“ Maxim Egorushkin”在他的回答的评论中阐明的那样,我对NaN为!(-0.0)的期望是错误的。NaN似乎不是唯一的位模式(请参阅https://steve.hollasch.net/cgindex/coding/ieeefloat.html)。
非常感谢大家!
AVX版本float和double:
#include <immintrin.h>
__m256 copysign_ps(__m256 from, __m256 to) {
constexpr float signbit = -0.f;
auto const avx_signbit = _mm256_broadcast_ss(&signbit);
return _mm256_or_ps(_mm256_and_ps(avx_signbit, from), _mm256_andnot_ps(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}
__m256d copysign_pd(__m256d from, __m256d to) {
constexpr double signbit = -0.;
auto const avx_signbit = _mm256_broadcast_sd(&signbit);
return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), _mm256_andnot_pd(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}
Run Code Online (Sandbox Code Playgroud)
使用AVX2时,avx_signbit可以不生成任何常量:
__m256 copysign2_ps(__m256 from, __m256 to) {
auto a = _mm256_castps_si256(from);
auto avx_signbit = _mm256_castsi256_ps(_mm256_slli_epi32(_mm256_cmpeq_epi32(a, a), 31));
return _mm256_or_ps(_mm256_and_ps(avx_signbit, from), _mm256_andnot_ps(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}
__m256d copysign2_pd(__m256d from, __m256d to) {
auto a = _mm256_castpd_si256(from);
auto avx_signbit = _mm256_castsi256_pd(_mm256_slli_epi64(_mm256_cmpeq_epi64(a, a), 63));
return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), _mm256_andnot_pd(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}
Run Code Online (Sandbox Code Playgroud)
仍然虽然,两者clang并gcc计算avx_signbit在编译时和与来自加载常量替换它.rodata部分,该部分是,IMO,次优的。
如果您以 icc 为目标,这是一个我认为比接受的答案略好的版本:
__m256d copysign_pd(__m256d from, __m256d to) {
__m256d const avx_sigbit = _mm256_set1_pd(-0.);
return _mm256_or_pd(_mm256_and_pd(avx_sigbit, from), _mm256_andnot_pd(avx_sigbit, to));
}
Run Code Online (Sandbox Code Playgroud)
它使用_mm256_set1_pd而不是广播内在函数。在 clang 和 gcc 上,这主要是一个清洗,但在 icc 上,广播版本实际上将一个常量写入堆栈,然后从中进行广播,这太糟糕了。
Godbolt表示AVX-512码,调整-march=到-march=skylake看AVX2代码。
这是一个未经测试的 AVX-512 版本,它vpterlogdq直接使用,它编译成一条vpterlogd关于 icc 和 clang 的指令(gcc 包括一个单独的广播):
__m512d copysign_pd_alt(__m512d from, __m512d to) {
const __m512i sigbit = _mm512_castpd_si512(_mm512_set1_pd(-0.));
return _mm512_castsi512_pd(_mm512_ternarylogic_epi64(_mm512_castpd_si512(from), _mm512_castpd_si512(to), sigbit, 0xE4));
}
Run Code Online (Sandbox Code Playgroud)
当启用 AVX-512 但您正在处理__m256*向量时,您可以制作 256 位版本。