SSE/AVX:根据每个元素的最小和最大绝对值从两个__m256浮点向量中选择

use*_*436 4 sse intrinsics avx avx512

我正在寻找高效的AVX(AVX512)实现

// Given
float u[8];
float v[8];

// Compute
float a[8];
float b[8];

//  Such that
for ( int i = 0; i < 8; ++i )
{
    a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i];
    b[i] = fabs(u[i]) <  fabs(v[i]) ? u[i] : v[i];
}
Run Code Online (Sandbox Code Playgroud)

也就是说,我需要选择逐个元素为auv基础mask,并为b基于!mask,在mask = (fabs(u) >= fabs(v))逐元素.

Jas*_*n R 5

前几天我遇到了同样的问题.我想出的解决方案(仅使用AVX)是:

// take the absolute value of u and v
__m256 sign_bit = _mm256_set1_ps(-0.0f);
__m256 u_abs = _mm256_andnot_ps(sign_bit, u);
__m256 v_abs = _mm256_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__m256 u_ge_v = _mm256_cmp_ps(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m256 a = _mm256_blendv_ps(u, v, u_ge_v);
__m256 b = _mm256_blendv_ps(v, u, u_ge_v);
Run Code Online (Sandbox Code Playgroud)

AVX512相当于:

// take the absolute value of u and v
__m512 sign_bit = _mm512_set1_ps(-0.0f);
__m512 u_abs = _mm512_andnot_ps(sign_bit, u);
__m512 v_abs = _mm512_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__mmask16 u_ge_v = _mm512_cmp_ps_mask(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m512 a = _mm512_mask_blend_ps(u_ge_v, u, v);
__m512 b = _mm512_mask_blend_ps(u_ge_v, v, u);
Run Code Online (Sandbox Code Playgroud)

正如Peter Cordes在上面的评论中所建议的那样,还有其他方法,比如取绝对值后跟最小值/最大值然后重新插入符号位,但我找不到任何比这个序列更短/更短的延迟说明.