将整数向量转换为0到1之间的浮点数的最快速精确方法

Ser*_*tch 6 c random simd vectorization avx2

考虑随机生成的__m256i向量.是否有更快的精确方法将它们转换为(包含)和(仅)__m256之间的浮动向量而不是除以?01float(1ull<<32)

这是我到目前为止所尝试的,iRand输入在哪里,ans是输出:

const __m256 fRand = _mm256_cvtepi32_ps(iRand);
const __m256 normalized = _mm256_div_ps(fRand, _mm256_set1_ps(float(1ull<<32)));
const __m256 ans = _mm256_add_ps(normalized, _mm256_set1_ps(0.5f));
Run Code Online (Sandbox Code Playgroud)

Soo*_*nts 7

与您使用的初始版本相比,下面的版本应该更快 _mm256_div_ps

vdivps非常慢,例如在我的Haswell Xeon上,它的周期为18-21周期,吞吐量为14周期.较新的CPU表现更好BTW,Skylake为11/5,Ryzen为10/6.

如评论中所述,通过用乘法替换除法并通过FMA进一步改进,可以解决性能问题.该方法的问题在于分发质量.如果您尝试通过舍入模式或剪切在输出间隔中获取这些数字,则会在输出数字的概率分布中引入峰值.

我的实现也不理想,它不会在输出间隔中输出所有可能的值,跳过许多可表示的浮点数,尤其是接近0.但是至少分布非常均匀.

__m256 __vectorcall randomFloats( __m256i randomBits )
{
    // Convert to random float bits
    __m256 result = _mm256_castsi256_ps( randomBits );

    // Zero out exponent bits, leave random bits in mantissa.
    // BTW since the mask value is constexpr, we don't actually need AVX2 instructions for this, it's just easier to code with set1_epi32.
    const __m256 mantissaMask = _mm256_castsi256_ps( _mm256_set1_epi32( 0x007FFFFF ) );
    result = _mm256_and_ps( result, mantissaMask );

    // Set sign + exponent bits to that of 1.0, which is sign=0, exponent=2^0.
    const __m256 one = _mm256_set1_ps( 1.0f );
    result = _mm256_or_ps( result, one );

    // Subtract 1.0. The above algorithm generates floats in range [1..2).
    // Can't use bit tricks to generate floats in [0..1) because it would cause them to be distributed very unevenly.
    return _mm256_sub_ps( result, one );
}
Run Code Online (Sandbox Code Playgroud)

更新:如果您想要更好的精度,请使用以下版本.但它不再是"最快的".

__m256 __vectorcall randomFloats_32( __m256i randomBits )
{
    // Convert to random float bits
    __m256 result = _mm256_castsi256_ps( randomBits );
    // Zero out exponent bits, leave random bits in mantissa.
    const __m256 mantissaMask = _mm256_castsi256_ps( _mm256_set1_epi32( 0x007FFFFF ) );
    result = _mm256_and_ps( result, mantissaMask );
    // Set sign + exponent bits to that of 1.0, which is sign=0, exponent = 2^0.
    const __m256 one = _mm256_set1_ps( 1.0f );
    result = _mm256_or_ps( result, one );
    // Subtract 1.0. The above algorithm generates floats in range [1..2).
    result = _mm256_sub_ps( result, one );

    // Use 9 unused random bits to add extra randomness to the lower bits of the values.
    // This increases precision to 2^-32, however most floats in the range can't store that many bits, fmadd will only add them for small enough values.

    // If you want uniformly distributed floats with 2^-24 precision, replace the second argument in the following line with _mm256_set1_epi32( 0x80000000 ).
    // In this case you don't need to set rounding mode bits in MXCSR.
    __m256i extraBits = _mm256_and_si256( randomBits, _mm256_castps_si256( mantissaMask ) );
    extraBits = _mm256_srli_epi32( extraBits, 9 );
    __m256 extra = _mm256_castsi256_ps( extraBits );
    extra = _mm256_or_ps( extra, one );
    extra = _mm256_sub_ps( extra, one );
    _MM_SET_ROUNDING_MODE( _MM_ROUND_DOWN );
    constexpr float mul = 0x1p-23f; // The initial part of the algorithm has generated uniform distribution with the step 2^-23.
    return _mm256_fmadd_ps( extra, _mm256_set1_ps( mul ), result );
}
Run Code Online (Sandbox Code Playgroud)