使用浮点数或双打而不是整数

Question

使用浮点数或双打而不是整数

Jay*_*Jay 5 lua floating-accuracy

我知道Lua的默认实现仅使用浮点数,从而避免了在选择要使用的数学函数的哪个变量之前动态确定数字的子类型的问题.

我的问题是 - 如果我尝试在标准C99中将整数模拟为双精度(或浮点数),是否有一种可靠(简单)的方法来说明精确表示的最大值是什么？

我的意思是,如果我使用64位浮点数来表示整数,我当然不能代表所有64位整数(这里的鸽子原则适用).如何判断可表示的最大整数？

(尝试列出所有值不是解决方案 - 例如,如果我在64位架构中使用双打,因为我必须列出2 ^ {64}个数字)

谢谢!

Answer 1

Stu*_*ley 12

对于64位双精度,最大可表示整数是2 ⁵³(9007199254740992),对于32位浮点数,最大可表示整数是2 ²⁴(16777216).有关IEEE浮点数的信息,请参阅Wikipedia页面上的基准数字.

在Lua中验证这一点非常简单:

local maxdouble = 2^53

-- one less than the maximum can be represented precisely
print (string.format("%.0f",maxdouble-1)) --> 9007199254740991
-- the maximum itself can be represented precisely
print (string.format("%.0f",maxdouble))   --> 9007199254740992
-- one more than the maximum gets rounded down
print (string.format("%.0f",maxdouble+1)) --> 9007199254740992 again

Run Code Online (Sandbox Code Playgroud)

如果我们没有方便的IEEE定义的字段大小,知道我们对浮点数设计的了解,我们可以使用可能值的简单循环来确定这些值:

#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#define min(a, b) (a < b ? a : b)
#define bits(type) (sizeof(type) * 8)
#define testimax(test_t) { \
  uintmax_t in = 1, out = 2; \
  size_t pow = 0, limit = min(bits(test_t), bits(uintmax_t)); \
  while (pow < limit && out == in + 1) { \
    in = in << 1; \
    out = (test_t) in + 1; \
    ++pow; \
  } \
  if (pow == limit) \
    puts(#test_t " is as precise as longest integer type"); \
  else printf(#test_t " conversion imprecise for 2^%d+1:\n" \
    "   in: %llu\n  out: %llu\n\n", pow, in + 1, out); \
}

int main(void)
{
    testimax(float);
    testimax(double);
    return 0;
}

Run Code Online (Sandbox Code Playgroud)

上面代码的输出:

float conversion imprecise for 2^24+1:
   in: 16777217
  out: 16777216

double conversion imprecise for 2^53+1:
   in: 9007199254740993
  out: 9007199254740992

Run Code Online (Sandbox Code Playgroud)

当然,由于浮点精度的工作方式,64位双精度可以表示远大于2 ^64的数字,因为浮动指数增长为正.双精度浮点的维基百科页面描述:

在2 ⁵² = 4,503,599,627,370,496和2 ⁵³ = 9,007,199,254,740,992之间,可表示的数字正好是整数.对于下一个范围,从2 ⁵³到2 ⁵⁴,一切都乘以2,所以可表示的数字是偶数,等等.相反,对于之前的范围从2 ⁵¹到2 ⁵²,间距是0.5,等等.

double可以容纳的绝对最大值列在该页面的下方:0x7fefffffffffffff,其计算为(1 +(1 - 2 ^-52))*2 ¹⁰²³,或大致为1.7976931348623157e308.

归档时间：	14 年，8 月前
查看次数：	2516 次
最近记录：	10 年，7 月前