use*_*436 4 sse intrinsics avx avx512
我正在寻找高效的AVX(AVX512)实现
// Given
float u[8];
float v[8];
// Compute
float a[8];
float b[8];
// Such that
for ( int i = 0; i < 8; ++i )
{
a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i];
b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i];
}
Run Code Online (Sandbox Code Playgroud)
也就是说,我需要选择逐个元素为a从u和v基础mask,并为b基于!mask,在mask = (fabs(u) >= fabs(v))逐元素.
前几天我遇到了同样的问题.我想出的解决方案(仅使用AVX)是:
// take the absolute value of u and v
__m256 sign_bit = _mm256_set1_ps(-0.0f);
__m256 u_abs = _mm256_andnot_ps(sign_bit, u);
__m256 v_abs = _mm256_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__m256 u_ge_v = _mm256_cmp_ps(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m256 a = _mm256_blendv_ps(u, v, u_ge_v);
__m256 b = _mm256_blendv_ps(v, u, u_ge_v);
Run Code Online (Sandbox Code Playgroud)
AVX512相当于:
// take the absolute value of u and v
__m512 sign_bit = _mm512_set1_ps(-0.0f);
__m512 u_abs = _mm512_andnot_ps(sign_bit, u);
__m512 v_abs = _mm512_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__mmask16 u_ge_v = _mm512_cmp_ps_mask(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m512 a = _mm512_mask_blend_ps(u_ge_v, u, v);
__m512 b = _mm512_mask_blend_ps(u_ge_v, v, u);
Run Code Online (Sandbox Code Playgroud)
正如Peter Cordes在上面的评论中所建议的那样,还有其他方法,比如取绝对值后跟最小值/最大值然后重新插入符号位,但我找不到任何比这个序列更短/更短的延迟说明.
| 归档时间: |
|
| 查看次数: |
436 次 |
| 最近记录: |