优化元素2 ^ x-1的乘法

ale*_*cco 2 c simd avx

是否有任何已知的优化,用于乘以已知为2 ^ x-1(1,3,7 ......)的几个(3到5)字节(int8)

这是在使用(2 ^ x-1)/ 2 ^ x多次乘以字节数组的上下文中.除法是微不足道的(为右移添加指数),但分子有点麻烦.

此外,指数x仅在1..31中,并且所有的总和始终小于32.

// In reality there are 16 of these (i.e. a[16], b[16], c[16])
// ( a + b + c ) < 32
char  a = 2;
char  b = 16;
char  c = 8;

// Ratio/scale, there are 16 of these (i.e. r[16])
// It might work storing in log2 and using int8 or int16
// with fixed point approximation
<x?>  r = ( a - 1 ) * ( b - 1 ) * ( c - 1 ) / ( a * b * c );

// Big original value, just one
int   v = 1234567890;
// This might be done by scaling down to log2, too
// it is used for a comparison only
// doesn't need full 32b precission
// This is also 16 values, of course (i.e. rv[16])
int  rv = v * r;
Run Code Online (Sandbox Code Playgroud)

doy*_*nax 5

坦率地说,这个函数不适合AVX指令集,它缺乏整数运算.由SSE2或AVX2提供的直接整数左移几乎肯定是最快的方法.不过,根据您对Aleksander Z.的回答,我收集到的答案是,您希望评估其他方法.

将此问题强加到AVX设备上需要我们使用IEEE-754数字表示来创造.通过未对齐的加载和按位掩码,我们可以将各个字节值混合到32位浮点数的最顶层字节中,其中定义数字的2 ^ n幂的指数位于其中.

这几乎得到了我们所需的幂函数,除了我们缺少指数字段的最低有效位并且需要使用平方根来调整它.同样,我们还需要通过乘法设置指数偏差.

无论如何,请查看下面的代码以获取详细信息,因为在这里逐字逐句地重复评论没什么意义.请注意在阵列之前最多三个字节的未对齐读取(但忽略),因此请根据需要添加填充.还要注意结果字是交错的,result1存储字节{0,4,8,12,...}等等.

哦,显然结果将近似于使用浮点运算的结果.

void compute(const unsigned char (*ptr)[32], size_t len) {
    const __m256 mask = _mm256_castsi256_ps(_mm256_set1_epi32(0x3F000000U));
    const __m256 normalize = _mm256_castsi256_ps(_mm256_set1_epi32(0x7F000000U));
    const __m256 offset = _mm256_set1_ps(1);

    __m256 result1 = _mm256_set1_ps(1);
    __m256 result2 = _mm256_set1_ps(1);
    __m256 result3 = _mm256_set1_ps(1);
    __m256 result4 = _mm256_set1_ps(1);

    do {
        // Mask out every forth byte into a separate variable using unaligned
        // loads to simulate 8-to-32 bit integer unpacking
        __m256 real1 = _mm256_loadu_ps((const float *) &ptr[0][-3]); 
        __m256 real2 = _mm256_loadu_ps((const float *) &ptr[0][-2]);
        __m256 real3 = _mm256_loadu_ps((const float *) &ptr[0][-1]);
        __m256 real4 = _mm256_loadu_ps((const float *) &ptr[0][-0]);
        real1 = _mm256_and_ps(real1, mask);
        real2 = _mm256_and_ps(real2, mask);
        real3 = _mm256_and_ps(real3, mask);
        real4 = _mm256_and_ps(real4, mask);
        // The binary values are 2^2x * 2^-BIAS once the masked-once top bytes
        // are interpreted as IEEE-754 floating-point exponent bytes.
        // Unfortunately we are overshooting the exponent field by one bit,
        // hence the doubled exponents. Anyway, let's at least multiply the
        // bias away
        real1 = _mm256_mul_ps(real1, normalize);
        real2 = _mm256_mul_ps(real2, normalize);
        real3 = _mm256_mul_ps(real3, normalize);
        real4 = _mm256_mul_ps(real4, normalize);
        // Use a fast aproximate reciprocal square root to halve the exponent,
        // yielding ~1/2^x.
        // You'd think this case of the reciprocal lookup table would be
        // precise, yet it seems not to be. Perhaps twiddling the rounding
        // mode or biasing the values may make it so.
        real1 = _mm256_rsqrt_ps(real1);
        real2 = _mm256_rsqrt_ps(real2);
        real3 = _mm256_rsqrt_ps(real3);
        real4 = _mm256_rsqrt_ps(real4);
        // Compute (2^x-1)/2^x as 1-1/2^x
        real1 = _mm256_sub_ps(offset, real1);
        real2 = _mm256_sub_ps(offset, real2);
        real3 = _mm256_sub_ps(offset, real3);
        real4 = _mm256_sub_ps(offset, real4);
        // Finally multiply the running products
        result1 = _mm256_mul_ps(result1, real1);
        result2 = _mm256_mul_ps(result2, real2);
        result3 = _mm256_mul_ps(result3, real3);
        result4 = _mm256_mul_ps(result4, real4);
    } while(++ptr, --len);

    /*
     * Do something useful with result1..4 here
     */
}
Run Code Online (Sandbox Code Playgroud)