精确浮点计算两个乘积的和与差

Question

精确浮点计算两个乘积的和与差

nju*_*ffa 3 algorithm floating-point floating-accuracy

两个乘积之差和两个乘积之和是在各种常见计算中发现的两个原语。diff_of_products (a,b,c,d) := ab - cd 和 sum_of_products(a,b,c,d) := ab + cd 是密切相关的伴随函数，仅部分操作数的符号不同。使用这些原语的示例是：

\n

计算 x = (a + i b) 和 y = (c + i d)的复数乘法：

\n

x*y = diff_of_products (a, c, b, d) + i sum_of_products (a, d, b, c)

\n

计算 2x2 矩阵的行列式：diff_of_products (a, d, b, c)：

\n

| a  b |\n| c  d |\n

Run Code Online (Sandbox Code Playgroud)\n

在直角三角形中，计算斜边和相邻内切线的相对内切线的长度： diff_of_products (h, h, a, a)ha

\n

计算具有正判别式的二次方程的两个实数解：

\n

q = -(b + copysign (sqrt (diff_of_products (b, b, 4a, c)), b)) / 2
\nx ₀ = q / a
\nx ₁ = c / q

\n

计算 3D 叉积a = b \xe2\xa8\xaf c：

\n

a _x = diff_of_products (b _y , c _z , b _z , c _y )
\na _y = diff_of_products (b _z , c _x , b _x , c _z )
\na _z = diff_of_products (b _x , c _y , b _y , )

\n

当使用 IEEE-754 二进制浮点格式进行计算时，除了潜在上溢和下溢的明显问题之外，当两个乘积大小相似但 sum_of_products() 的符号相反或符号相同时，任一函数的简单实现都可能遭受灾难性的抵消对于 diff_of_products()。

\n

仅关注准确性方面，如何在 IEEE-754 二进制算术环境中稳健地实现这些函数？可以假设融合乘加运算的可用性，因为大多数现代处理器架构都支持该运算，并且通过标准函数以许多编程语言公开该运算。不失一般性，讨论可以限制为单精度（IEEE-754 binary32）格式，以便于阐述和测试。

\n

Answer 1

nju*_*ffa 6

融合乘加 (FMA) 运算在提供针对减法抵消的保护方面的效用源于最终加法中全双宽乘积的参与。据我所知，关于其在准确、鲁棒地计算二次方程解方面的实用性的第一个公开记录是著名浮点专家 William Kahan 的两组非正式笔记：

\n

William Kahan，“Matlab\xe2\x80\x99s 的损失是没有人\xe2\x80\x99s 的收益”。1998 年 8 月，2004 年 7 月修订（在线）
\nWilliam Kahan，“论没有超精确算术的浮点计算的成本”。2004 年 11 月（在线）

\n

Higham 的数值计算标准著作是我第一次遇到 Kahan 算法应用于计算 2x2 矩阵行列式的著作（第 65 页）：

\n

Nicholas J. Higham，“数值算法的准确性和稳定性”，SIAM 1996

\n

三位英特尔研究人员在英特尔首款支持 FMA 的 CPU 安腾处理器上发布了一种不同的 ab+cd 计算算法，该算法也基于 FMA（第 273 页）：

\n

Marius Cornea、John Harrison 和 Ping Tak Peter Tang：“基于安腾的系统上的科学计算”。英特尔出版社 2002 年

\n

近年来，法国研究人员发表的四篇论文详细研究了这两种算法，并提供了误差范围和数学证明。对于二进制浮点运算，假设中间计算没有上溢或下溢，Kahan 算法和 Cornea-Harrison-Tang (CHT) 算法的最大相对误差都是单位舍入的两倍渐近，即 2 u。对于 IEEE-754binary32或单精度，此错误界限为 2 ^-23，对于 IEEE-754binary64或双精度，此错误界限为 2 ^-52。

\n

此外，结果表明，对于二进制浮点运算，Kahan 算法的误差最多为 1.5 ulps。从文献中我不知道 CHT 算法的等效结果，即经过验证的 ulp 错误界限。我自己使用下面的代码进行的实验表明误差范围为 1.25 ulp。

\n

Sylvie Boldo，“Kahan\xe2\x80\x99s 正确判别计算算法终于得到正式证明”，\n IEEE Transactions on Computers，卷。58，第2期，2009年2月，第220-225页（在线）

\n

Claude-Pierre Jeannerod、Nicolas Louvet 和 Jean-Michel Muller，“卡汉精确计算 2x2 行列式算法的进一步分析”，计算数学，卷。82，第284期，2013年10月，第2245-2264页（在线）

\n

Jean-Michel Muller，“关于使用 Cornea、Harrison 和 Tang 方法计算 ab+cd 的错误”，ACM Transactions on Mathematical Software，卷。41、2015年1月第2号第七条（在线）

\n

Claude-Pierre Jeannerod，“Cornea-Harrison-Tang 方法的基数无关误差分析”，ACM 数学软件汇刊卷。42、2016年5月第3号第十九条（在线）

\n

Kahan 算法需要 4 次浮点运算，其中 2 次是 FMA，而 CHT 算法则需要 7 次浮点运算，其中 2 次是 FMA。我构建了下面的测试框架来探索可能存在的其他权衡。我通过实验证实了文献中关于两种算法的相对误差和 Kahan 算法的 ulp 误差的界限。我的实验表明，CHT 算法提供了 1.25 ulp 的较小 ulp 误差范围，但它也会以大约是 Kahan 算法两倍的速率产生错误舍入的结果。

\n

#include <stdio.h>\n#include <stdlib.h>\n#include <stdint.h>\n#include <string.h>\n#include <float.h>\n#include <math.h>\n\n#define TEST_SUM  (0)  // function under test. 0: a*b-c*d; 1: a*b+c*d \n#define USE_CHT   (0)  // algorithm. 0: Kahan; 1: Cornea-Harrison-Tang\n\n/*\n  Compute a*b-c*d with error <= 1.5 ulp. Maximum relative err = 2**-23\n\n  Claude-Pierre Jeannerod, Nicolas Louvet, and Jean-Michel Muller,\n  "Further Analysis of Kahan\'s Algorithm for the Accurate Computation \n  of 2x2 Determinants", Mathematics of Computation, Vol. 82, No. 284, \n  Oct. 2013, pp. 2245-2264\n*/\nfloat diff_of_products_kahan (float a, float b, float c, float d)\n{\n    float w = d * c;\n    float e = fmaf (c, -d, w);\n    float f = fmaf (a, b, -w);\n    return f + e;\n}\n\n/*\n  Compute a*b-c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23\n\n  Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the \n  Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software\n  Vol. 42, No. 3, Article 19 (May 2016).\n*/\nfloat diff_of_products_cht (float a, float b, float c, float d)\n{\n    float p1 = a * b; \n    float p2 = c * d;\n    float e1 = fmaf (a, b, -p1); \n    float e2 = fmaf (c, -d, p2);\n    float r = p1 - p2; \n    float e = e1 + e2;\n    return r + e;\n}\n\n/*\n  Compute a*b+c*d with error <= 1.5 ulp. Maximum relative err = 2**-23\n\n  Jean-Michel Muller, "On the Error of Computing ab+cd using Cornea,\n  Harrison and Tang\'s Method", ACM Transactions on Mathematical Software, \n  Vol. 41, No.2, Article 7, (January 2015)\n*/\nfloat sum_of_products_kahan (float a, float b, float c, float d)\n{\n    float w = c * d;\n    float e = fmaf (c, -d, w);\n    float f = fmaf (a, b, w);\n    return f - e;\n}\n\n/*\n  Compute a*b+c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23\n\n  Claude-Pierre Jeannerod, "A Radix-Independent Error Analysis of the \n  Cornea-Harrison-Tang Method", ACM Transactions on Mathematical Software\n  Vol. 42, No. 3, Article 19 (May 2016).\n*/\nfloat sum_of_products_cht (float a, float b, float c, float d)\n{\n    float p1 = a * b; \n    float p2 = c * d;\n    float e1 = fmaf (a, b, -p1); \n    float e2 = fmaf (c, d, -p2);\n    float r = p1 + p2; \n    float e = e1 + e2;\n    return r + e;\n}\n\n// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007\nstatic unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;\n#define znew (z=36969*(z&0xffff)+(z>>16))\n#define wnew (w=18000*(w&0xffff)+(w>>16))\n#define MWC  ((znew<<16)+wnew)\n#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */\n#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */\n#define KISS ((MWC^CONG)+SHR3)\n\ntypedef struct {\n    double y;\n    double x;\n} dbldbl;\n\ndbldbl make_dbldbl (double head, double tail)\n{\n    dbldbl z;\n    z.x = tail;\n    z.y = head;\n    return z;\n}\n\ndbldbl add_dbldbl (dbldbl a, dbldbl b) {\n    dbldbl z;\n    double t1, t2, t3, t4, t5;\n    t1 = a.y + b.y;\n    t2 = t1 - a.y;\n    t3 = (a.y + (t2 - t1)) + (b.y - t2);\n    t4 = a.x + b.x;\n    t2 = t4 - a.x;\n    t5 = (a.x + (t2 - t4)) + (b.x - t2);\n    t3 = t3 + t4;\n    t4 = t1 + t3;\n    t3 = (t1 - t4) + t3;\n    t3 = t3 + t5;\n    z.y = t4 + t3;\n    z.x = (t4 - z.y) + t3;\n    return z;\n}\n\ndbldbl sub_dbldbl (dbldbl a, dbldbl b)\n{\n    dbldbl z;\n    double t1, t2, t3, t4, t5;\n    t1 = a.y - b.y;\n    t2 = t1 - a.y;\n    t3 = (a.y + (t2 - t1)) - (b.y + t2);\n    t4 = a.x - b.x;\n    t2 = t4 - a.x;\n    t5 = (a.x + (t2 - t4)) - (b.x + t2);\n    t3 = t3 + t4;\n    t4 = t1 + t3;\n    t3 = (t1 - t4) + t3;\n    t3 = t3 + t5;\n    z.y = t4 + t3;\n    z.x = (t4 - z.y) + t3;\n    return z;\n}\n\ndbldbl mul_dbldbl (dbldbl a, dbldbl b)\n{\n    dbldbl t, z;\n    t.y = a.y * b.y;\n    t.x = fma (a.y, b.y, -t.y);\n    t.x = fma (a.x, b.x, t.x);\n    t.x = fma (a.y, b.x, t.x);\n    t.x = fma (a.x, b.y, t.x);\n    z.y = t.y + t.x;\n    z.x = (t.y - z.y) + t.x;\n    return z;\n}\n\ndouble prod_diff_ref (float a, float b, float c, float d)\n{\n    dbldbl t = sub_dbldbl (\n        mul_dbldbl (make_dbldbl ((double)a, 0), make_dbldbl ((double)b, 0)),\n        mul_dbldbl (make_dbldbl ((double)c, 0), make_dbldbl ((double)d, 0))\n        );\n    return t.x + t.y;\n}\n\ndouble prod_sum_ref (float a, float b, float c, float d)\n{\n    dbldbl t = add_dbldbl (\n        mul_dbldbl (make_dbldbl ((double)a, 0), make_dbldbl ((double)b, 0)),\n        mul_dbldbl (make_dbldbl ((double)c, 0), make_dbldbl ((double)d, 0))\n        );\n    return t.x + t.y;\n}\n\nfloat __uint32_as_float (uint32_t a)\n{\n    float r;\n    memcpy (&r, &a, sizeof r);\n    return r;\n}\n\nuint32_t __float_as_uint32 (float a)\n{\n    uint32_t r;\n    memcpy (&r, &a, sizeof r);\n    return r;\n}\n\nuint64_t __double_as_uint64 (double a)\n{\n    uint64_t r;\n    memcpy (&r, &a, sizeof r);\n    return r;\n}\n\nstatic double floatUlpErr (float res, double ref)\n{\n    uint64_t i, j, err;\n    int expoRef;\n    \n    /* ulp error cannot be computed if either operand is NaN, infinity, zero */\n    if (isnan(res) || isnan (ref) || isinf(res) || isinf (ref) ||\n        (res == 0.0f) || (ref == 0.0f)) {\n        return 0.0;\n    }\n    /* Convert the float result to an "extended float". This is like a float\n       with 56 instead of 24 effective mantissa bits.\n    */\n    i = ((uint64_t)__float_as_uint32(res)) << 32;\n    /* Convert the double reference to an "extended float". If the reference is\n       >= 2^129, we need to clamp to the maximum "extended float". If reference\n       is < 2^-126, we need to denormalize because of float\'s limited exponent\n       range.\n    */\n    expoRef = (int)(((__double_as_uint64(ref) >> 52) & 0x7ff) - 1023);\n    if (expoRef >= 129) {\n        j = (__double_as_uint64(ref) & 0x8000000000000000ULL) |\n            0x7fffffffffffffffULL;\n    } else if (expoRef < -126) {\n        j = ((__double_as_uint64(ref) << 11) | 0x8000000000000000ULL) >> 8;\n        j = j >> (-(expoRef + 126));\n        j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);\n    } else {\n        j = ((__double_as_uint64(ref) << 11) & 0x7fffffffffffffffULL) >> 8;\n        j = j | ((uint64_t)(expoRef + 127) << 55);\n        j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);\n    }\n    err = (i < j) ? (j - i) : (i - j);\n    return err / 4294967296.0;\n}\n\nint main (void)\n{\n    const float ULMT = sqrtf (FLT_MAX) / 2; // avoid overflow\n    const float LLMT = sqrtf (FLT_MIN) * 2; // avoid underflow\n    const uint64_t N = 1ULL << 38;\n    double ref, ulp, relerr, maxrelerr = 0, maxulp = 0;\n    uint64_t count = 0LL, incorrectly_rounded = 0LL;\n    uint32_t ai, bi, ci, di;\n    float af, bf, cf, df, resf;\n\n#if TEST_SUM\n    printf ("testing a*b+c*d ");\n#else\n    printf ("testing a*b-c*d ");\n#endif // TEST_SUM\n#if USE_CHT\n    printf ("using Cornea-Harrison-Tang algorithm\\n");\n#else\n    printf ("using Kahan algorithm\\n");\n#endif\n\n    do {\n        do {\n            ai = KISS;\n            af = __uint32_as_float (ai);\n        } while (!isfinite(af) || (fabsf (af) > ULMT) || (fabsf (af) < LLMT));\n        do {\n            bi = KISS;\n            bf = __uint32_as_float (bi);\n        } while (!isfinite(bf) || (fabsf (bf) > ULMT) || (fabsf (bf) < LLMT));\n        do {\n            ci = KISS;\n            cf = __uint32_as_float (ci);\n        } while (!isfinite(cf) || (fabsf (cf) > ULMT) || (fabsf (cf) < LLMT));\n        do {\n            di = KISS;\n            df = __uint32_as_float (di);\n        } while (!isfinite(df) || (fabsf (df) > ULMT) || (fabsf (df) < LLMT));\n        count++;\n#if TEST_SUM        \n#if USE_CHT\n        resf = sum_of_products_cht (af, bf, cf, df);\n#else // USE_CHT\n        resf = sum_of_products_kahan (af, bf, cf, df);\n#endif // USE_CHT\n        ref = prod_sum_ref (af, bf, cf, df);\n#else // TEST_SUM\n#if USE_CHT\n        resf = diff_of_products_cht (af, bf, cf, df);\n#else // USE_CHT\n        resf = diff_of_products_kahan (af, bf, cf, df);\n#endif // USE_CHT\n        ref = prod_diff_ref (af, bf, cf, df);\n#endif // TEST_SUM\n        ulp = floatUlpErr (resf, ref);\n        incorrectly_rounded += ulp > 0.5;\n        relerr = fabs ((resf - ref) / ref);\n        if ((ulp > maxulp) || ((ulp == maxulp) && (relerr > maxrelerr))) {\n            maxulp = ulp;\n            maxrelerr = relerr;\n            printf ("%13llu %12llu ulp=%.9f a=% 15.8e b=% 15.8e c=% 15.8e d=% 15.8e res=% 16.6a ref=% 23.13a relerr=%13.9e\\n",\n                    count, incorrectly_rounded, ulp, af, bf, cf, df, resf, ref, relerr);\n        }\n    } while (count <= N);\n\n    return EXIT_SUCCESS;\n}\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	5 年前
查看次数：	451 次
最近记录：	5 年前