为什么在“DBL_MAX”和无穷大附近会出现加法错误?

chu*_*ica 6 c floating-point infinity

在试验舍入模式时,FE_DOWNWARD( 和FE_TOWARDZERO) 显然无法形成预期的无穷大之和,而是DBL_MAX在添加DBL_MAX和 1 ULP时形成DBL_MAX

以下是演示意外总和的测试代码。在不同的回合模式下,它会增加DBL_MAX接近 0.5 ULP 和 1.0 ULP 的值。FE_TONEAREST和中没有指出任何问题FE_UPWARD

问题:

  1. 你同意这是一个错误吗?
  2. 代码在另一台机器上是否能形成正确的答案?
  3. 可悲的是,这是最近报告的另一个近乎DBL_MAX问题之后的结果,所以也许我的数学库低于标准。请求就如何报告提供建议。

编译器注释:
调用:Cygwin C 编译器
gcc -std=c11 -O0 -g3 -pedantic -Wall -Wextra -Wconversion -c -fmessage-length=0 -v -MMD -MP -MF"rand_i.d" -MT"rand_i .o" -o "rand_i.o" "../rand_i.c"
COLLECT_GCC=gcc
目标:x86_64-pc-cygwin
gcc 版本 11.3.0 (GCC)
GNU C11 (GCC) 版本 11.3.0 (x86_64-pc- cygwin)
由GNU C版本11.3.0、GMP版本6.2.1、MPFR版本4.1.0、MPC版本1.2.1、isl版本isl-0.25-GMP编译


#include <fenv.h>
#include <float.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>

#define STRINGIFY(x) #x
#define TOSTRING(x) STRINGIFY(x)

int main() {
  const double max = DBL_MAX;
  const double max_ulp = max - nextafter(max, 0);

  int mode[] = {FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD};
  const char *modes[] = {STRINGIFY(FE_DOWNWARD), STRINGIFY(FE_TONEAREST), //
      STRINGIFY(FE_TOWARDZERO), STRINGIFY(FE_UPWARD)};
  int n = sizeof mode / sizeof mode[0];
  int p = (DBL_MANT_DIG + 2)/4;
  int P = (LDBL_MANT_DIG + 2)/4;
  printf("%s:%d\n", STRINGIFY(FLT_EVAL_METHOD), FLT_EVAL_METHOD);

  printf("LDBL_MAX   :%-24.*La %.*Lg\n", P, LDBL_MAX, LDBL_DECIMAL_DIG, LDBL_MAX);
  printf("DBL_MAX    :%-24.*a %.*g\n", p, max, DBL_DECIMAL_DIG, max);
  printf("DBL_MAX_ULP:%-24.*a %.*g\n", p, max_ulp, DBL_DECIMAL_DIG, max_ulp);
  printf("\n");
  printf("mode:               Addendum                          SUM (double)                 SUM (long double)\n");
  for (int i = 0; i < n; i++) {
    if (fesetround(mode[i])) {
      perror("Invalid mode");
      return -1;
    }
    double delta[] = {nextafter(max_ulp / 2, 0), max_ulp / 2, //
        nextafter(max_ulp / 2, INFINITY), nextafter(max_ulp, 0), //
        max_ulp, nextafter(max_ulp, INFINITY)};
    const char *deltas[] = { "0.5 ulp-", "0.5 ulp", "0.5 ulp+", //
        "ulp-", "ulp", "ulp+"};
    int dn = sizeof delta / sizeof delta[0];
    for (int d = 0; d < dn; d++) {
      double sum = max + delta[d];
      printf("mode:%-14s %-8s:%-24.*a sum:%-24.*a %-24.*La\n", //
          modes[i],  deltas[d], p, delta[d], p, sum, P, 0.0L + max + delta[d]);
    }
    puts("");
  }
}
Run Code Online (Sandbox Code Playgroud)

输出:注意 4条意外的行。

FLT_EVAL_METHOD:0
LDBL_MAX   :0x1.fffffffffffffffep+16383 1.18973149535723176502e+4932
DBL_MAX    :0x1.fffffffffffffp+1023  1.7976931348623157e+308
DBL_MAX_ULP:0x1.0000000000000p+971   1.9958403095347198e+292

mode:               Addendum                          SUM (double)                 SUM (long double)
mode:FE_DOWNWARD    0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff7fep+1023
mode:FE_DOWNWARD    0.5 ulp :0x1.0000000000000p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_DOWNWARD    0.5 ulp+:0x1.0000000000001p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_DOWNWARD    ulp-    :0x1.fffffffffffffp+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffffffep+1023
mode:FE_DOWNWARD    ulp     :0x1.0000000000000p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024
                                                      --> inf expected <--
mode:FE_DOWNWARD    ulp+    :0x1.0000000000001p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024
                                                      --> inf expected <--

mode:FE_TONEAREST   0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_TONEAREST   0.5 ulp :0x1.0000000000000p+970   sum:inf                      0x1.fffffffffffff800p+1023
mode:FE_TONEAREST   0.5 ulp+:0x1.0000000000001p+970   sum:inf                      0x1.fffffffffffff800p+1023
mode:FE_TONEAREST   ulp-    :0x1.fffffffffffffp+970   sum:inf                      0x1.0000000000000000p+1024
mode:FE_TONEAREST   ulp     :0x1.0000000000000p+971   sum:inf                      0x1.0000000000000000p+1024
mode:FE_TONEAREST   ulp+    :0x1.0000000000001p+971   sum:inf                      0x1.0000000000000000p+1024

mode:FE_TOWARDZERO  0.5 ulp-:0x1.fffffffffffffp+969   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff7fep+1023
mode:FE_TOWARDZERO  0.5 ulp :0x1.0000000000000p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_TOWARDZERO  0.5 ulp+:0x1.0000000000001p+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffff800p+1023
mode:FE_TOWARDZERO  ulp-    :0x1.fffffffffffffp+970   sum:0x1.fffffffffffffp+1023  0x1.fffffffffffffffep+1023
mode:FE_TOWARDZERO  ulp     :0x1.0000000000000p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024
                                                      --> inf expected <--
mode:FE_TOWARDZERO  ulp+    :0x1.0000000000001p+971   sum:0x1.fffffffffffffp+1023  0x1.0000000000000000p+1024
                                                      --> inf expected <--

mode:FE_UPWARD      0.5 ulp-:0x1.fffffffffffffp+969   sum:inf                      0x1.fffffffffffff800p+1023
mode:FE_UPWARD      0.5 ulp :0x1.0000000000000p+970   sum:inf                      0x1.fffffffffffff800p+1023
mode:FE_UPWARD      0.5 ulp+:0x1.0000000000001p+970   sum:inf                      0x1.fffffffffffff802p+1023
mode:FE_UPWARD      ulp-    :0x1.fffffffffffffp+970   sum:inf                      0x1.0000000000000000p+1024
mode:FE_UPWARD      ulp     :0x1.0000000000000p+971   sum:inf                      0x1.0000000000000000p+1024
mode:FE_UPWARD      ulp+    :0x1.0000000000001p+971   sum:inf                      0x1.0000000000000002p+1024
Run Code Online (Sandbox Code Playgroud)

Sne*_*tel 2

检查和处理溢出分两个阶段进行。请注意,实际的软件/电路可能使用不同的策略,但结果将就像这就是过程一样。

  1. 无限精度的计算结果将进行舍入,规则基于当前舍入模式,但对指数没有限制。在此阶段,不存在“无穷大”这样的东西,但数字可以达到所需的大小。
  2. 如果结果在可表示范围之外,则以基于舍入模式的方式将其“校正”到可表示范围内。对于FE_NEAREST(“正常”模式),数字被修正为无穷大。FE_TOWARDZERO然而,对于,它被更正为+/-DBL_MAX。(对于其他舍入模式,它取决于符号:远离零舍入导致无穷大,向零舍入导致+/-DBL_MAX。)

因此,给定模式的溢出规则让人想起该模式的舍入规则,但实际上并不相同。可以说FE_NEAREST这是一种奇怪的模式,因为在所有超出范围的情况下,它的行为就像(仅限 IEEE-754)远离零的舍入模式,而不是注意到无穷远比得多DBL_MAX。但其他模式的基本行为是+/-DBL_MAX当其首选方向(给定符号)接近零时输出,当其首选方向远离零时输出无穷大。

另请注意,当阶段 1 的结果是可表示范围之外的值时,即使阶段 2 的结果是 ,仍然会发出溢出异常DBL_MAX。溢出并不表示“我做了无穷大”,而是“我必须做第二阶段”。