相关疑难解决方法(0)

我创建了一个使用SIMD进行64位*64位到128位的功能.目前我已经使用SSE2(acutally SSE4.1)实现了它.这意味着它可以同时运行两个64b*64b到128b的产品.同样的想法可以扩展到AVX2或AVX512,同时提供四个或八个64b*64到128b的产品.我的算法基于http://www.hackersdelight.org/hdcodetxt/muldws.c.txt

该算法进行一次无符号乘法,一次有符号乘法和两次有符号*无符号乘法.签名的*signed和unsigned*unsigned操作很容易使用_mm_mul_epi32和_mm_mul_epu32.但混合签名和未签名的产品给我带来了麻烦.例如,考虑一下.

int32_t x = 0x80000000;
uint32_t y = 0x7fffffff;
int64_t z = (int64_t)x*y;

Run Code Online (Sandbox Code Playgroud)

双字产品应该是0xc000000080000000.但是如果你假设你的编译器知道如何处理混合类型,你怎么能得到这个呢？这就是我想出的:

int64_t sign = x<0; sign*=-1;        //get the sign and make it all ones
uint32_t t = abs(x);                 //if x<0 take two's complement again
uint64_t prod = (uint64_t)t*y;       //unsigned product
int64_t z = (prod ^ sign) - sign;    //take two's complement based on the sign

Run Code Online (Sandbox Code Playgroud)

使用SSE可以这样做

__m128i xh;    //(xl2, xh2, xl1, xh1) high is signed, low unsigned
__m128i …

Run Code Online (Sandbox Code Playgroud)

c x86 integer sse bit-manipulation

Z b*_*son

2016 12-28

9
推荐指数

2
解决办法

4203
查看次数

优化快速乘法但缓慢添加:FMA和doubledouble

当我第一次使用Haswell处理器时,我尝试使用FMA来确定Mandelbrot集.主要算法是这样的:

intn = 0;
for(int32_t i=0; i<maxiter; i++) {
    floatn x2 = square(x), y2 = square(y); //square(x) = x*x
    floatn r2 = x2 + y2;
    booln mask = r2<cut; //booln is in the float domain non integer domain
    if(!horizontal_or(mask)) break; //_mm256_testz_pd(mask)
    n -= mask
    floatn t = x*y; mul2(t); //mul2(t): t*=2
    x = x2 - y2 + cx;
    y = t + cy;
}

Run Code Online (Sandbox Code Playgroud)

这确定n像素是否在Mandelbrot集中.因此对于双浮点,它运行超过4个像素(floatn = __m256d,intn = __m256i).这需要4个SIMD浮点乘法和4个SIMD浮点加法.

然后我修改了这个就像这样使用FMA