单精度浮点精度降低精度差

Ver*_*ian 16 c c++ algorithm floating-point matlab

我正在尝试将范围缩减作为实现正弦函数的第一步.

我遵循KC NG的论文"减少巨大论据"中描述的方法

当使用0到20000的输入范围时,我得到的误差大到0.002339146.我的错误显然不应该那么大,我不知道如何减少它.我注意到误差幅度与输入θ幅度与余弦/正弦相关.

我能够获得本文提到的nearpi.c代码,但我不确定如何将代码用于单精度浮点.如果有人有兴趣,可以在以下链接找到nearpi.c文件:nearpi.c

这是我的MATLAB代码:

x = 0:0.1:20000;

% Perform range reduction
% Store constant 2/pi
twooverpi = single(2/pi);

% Compute y
y = (x.*twooverpi);

% Compute k (round to nearest integer
k = round(y);

% Solve for f
f = single(y-k);

% Solve for r
r = single(f*single(pi/2));

% Find last two bits of k
n = bitand(fi(k,1,32,0),fi(3,1,32,0));
n = single(n);

% Preallocate for speed
z(length(x)) = 0;
for i = 1:length(x)

    switch(n(i))
        case 0
            z(i)=sin(r(i));
        case 1
            z(i) = single(cos(r(i)));
        case 2
            z(i) = -sin(r(i));
        case 3
            z(i) = single(-cos(r(i)));
        otherwise
    end

end

maxerror = max(abs(single(z - single(sin(single(x))))))
minerror = min(abs(single(z - single(sin(single(x))))))
Run Code Online (Sandbox Code Playgroud)

我编辑了nearpi.c程序,以便编译.但是我不确定如何解释输出.此外,文件需要输入,我必须手动输入,我也不确定输入的重要性.

这是工作nearpi.c:

/*
 ============================================================================
 Name        : nearpi.c
 Author      : 
 Version     :
 Copyright   : Your copyright notice
 Description : Hello World in C, Ansi-style
 ============================================================================
 */

#include <stdio.h>
#include <stdlib.h>
#include <math.h>


/*
 * Global macro definitions.
 */

# define hex( double )  *(1 + ((long *) &double)), *((long *) &double)
# define sgn(a)         (a >= 0 ? 1 : -1)
# define MAX_k          2500
# define D              56
# define MAX_EXP        127
# define THRESHOLD      2.22e-16

/*
 *  Global Variables
 */

int     CFlength,               /* length of CF including terminator */
        binade;
double  e,
        f;                      /* [e,f] range of D-bit unsigned int of f;
                                   form 1X...X */

// Function Prototypes
int dbleCF (double i[], double j[]);
void input (double i[]);
void nearPiOver2 (double i[]);


/*
 *  This is the start of the main program.
 */

int main (void)
{
    int     k;                  /* subscript variable */
    double  i[MAX_k],
            j[MAX_k];           /* i and j are continued fractions
                                   (coeffs) */


   // fp = fopen("/src/cfpi.txt", "r");


/*
 *  Compute global variables e and f, where
 *
 *      e = 2 ^ (D-1), i.e. the D bit number 10...0
 *  and
 *      f = 2 ^ D - 1, i.e. the D bit number 11...1  .
 */

    e = 1;
    for (k = 2; k <= D; k = k + 1)
        e = 2 * e;
    f = 2 * e - 1;

 /*
  *  Compute the continued fraction for  (2/e)/(pi/2)  , i.e.
  *  q's starting value for the first binade, given the continued
  *  fraction for  pi  as input; set the global variable CFlength
  *  to the length of the resulting continued fraction (including
  *  its negative valued terminator).  One should use as many
  *  partial coefficients of  pi  as necessary to resolve numbers
  *  of the width of the underflow plus the overflow threshold.
  *  A rule of thumb is 0.97 partial coefficients are generated
  *  for every decimal digit of  pi .
  *
  *  Note: for radix B machines, subroutine  input  should compute
  *  the continued fraction for  (B/e)/(pi/2)  where  e = B ^ (D - 1).
  */

    input (i);

/*
 *  Begin main loop over all binades:
 *  For each binade, find the nearest multiples of pi/2 in that binade.
 *
 *  [ Note: for hexadecimal machines ( B = 16 ), the rest of the main
 *  program simplifies(!) to
 *
 *                      B_ade = 1;
 *                      while (B_ade < MAX_EXP)
 *                      {
 *                          dbleCF (i, j);
 *                          dbleCF (j, i);
 *                          dbleCF (i, j);
 *                          CFlength = dbleCF (j, i);
 *                          B_ade = B_ade + 1;
 *                      }
 *                  }
 *
 *  because the alternation of source & destination are no longer necessary. ]
 */

    binade = 1;
    while (binade < MAX_EXP)
    {

/*
 *  For the current (odd) binade, find the nearest multiples of pi/2.
 */

        nearPiOver2 (i);

/*
 *  Double the continued fraction to get to the next (even) binade.
 *  To save copying arrays, i and j will alternate as the source
 *  and destination for the continued fractions.
 */

        CFlength = dbleCF (i, j);
        binade = binade + 1;

/*
 *  Check for main loop termination again because of the
 *  alternation.
 */

        if (binade >= MAX_EXP)
            break;

/*
 *  For the current (even) binade, find the nearest multiples of pi/2.
 */

        nearPiOver2 (j);

/*
 *  Double the continued fraction to get to the next (odd) binade.
 */

        CFlength = dbleCF (j, i);
        binade = binade + 1;
    }

    return 0;
}                               /* end of Main Program */

/*
 *  Subroutine  DbleCF  doubles a continued fraction whose partial
 *  coefficients are i[] into a continued fraction j[], where both
 *  arrays are of a type sufficient to do D-bit integer arithmetic.
 *
 *  In my case ( D = 56 ) , I am forced to treat integers as double
 *  precision reals because my machine does not have integers of
 *  sufficient width to handle D-bit integer arithmetic.
 *
 *  Adapted from a Basic program written by W. Kahan.
 *
 *  Algorithm based on Hurwitz's method of doubling continued
 *  fractions (see Knuth Vol. 3, p.360).
 *
 *  A negative value terminates the last partial quotient.
 *
 *  Note:  for the non-C programmers, the statement  break
 *  exits a loop and the statement  continue  skips to the next
 *  case in the same loop.
 *
 *  The call  modf ( l / 2, &l0 )  assigns the integer portion of
 *  half of L to L0.
 */

int dbleCF (double i[], double j[])
{
      double k,
                    l,
                    l0,
                    j0;
      int    n,
                    m;
    n = 1;
    m = 0;
    j0 = i[0] + i[0];
    l = i[n];
    while (1)
    {
        if (l < 0)
        {
            j[m] = j0;
            break;
        };
        modf (l / 2, &l0);
        l = l - l0 - l0;
        k = i[n + 1];
        if (l0 > 0)
        {
            j[m] = j0;
            j[m + 1] = l0;
            j0 = 0;
            m = m + 2;
        };
        if (l == 0) {
/*
 *  Even case.
 */
            if (k < 0)
            {
                m = m - 1;
                break;
            }
            else
            {
                j0 = j0 + k + k;
                n = n + 2;
                l = i[n];
                continue;
            };
        }
/*
 *  Odd case.
 */
        if (k < 0)
        {
            j[m] = j0 + 2;
            break;
        };
        if (k == 0)
        {
            n = n + 2;
            l = l + i[n];
            continue;
        };
        j[m] = j0 + 1;
        m = m + 1;
        j0 = 1;
        l = k - 1;
        n = n + 1;
        continue;
    };
    m = m + 1;
    j[m] = -99999;
    return (m);
}

/*
 *  Subroutine  input  computes the continued fraction for
 *  (2/e) / (pi/2) , where  e = 2 ^ (D-1) , given  pi 's
 *  continued fraction as input.  That is, double the continued
 *  fraction of  pi   D-3  times and place a zero at the front.
 *
 *  One should use as many partial coefficients of  pi  as
 *  necessary to resolve numbers of the width of the underflow
 *  plus the overflow threshold.  A rule of thumb is  0.97
 *  partial coefficients are generated for every decimal digit
 *  of  pi .  The last coefficient of  pi  is terminated by a
 *  negative number.
 *
 *  I'll be happy to supply anyone with the partial coefficients
 *  of  pi .  My ARPA address is  mcdonald@ucbdali.BERKELEY.ARPA .
 *
 *  I computed the partial coefficients of  pi  using a method of
 *  Bill Gosper's.  I need only compute with integers, albeit
 *  large ones.  After writing the program in  bc  and  Vaxima  ,
 *  Prof. Fateman suggested  FranzLisp .  To my surprise, FranzLisp
 *  ran the fastest!  the reason?   FranzLisp's  Bignum  package is
 *  hand coded in assembler.  Also,  FranzLisp  can be compiled.
 *
 *
 *  Note: for radix B machines, subroutine  input  should compute
 *  the continued fraction for  (B/e)/(pi/2)  where  e = B ^ (D - 1).
 *  In the case of hexadecimal ( B = 16 ), this is done by repeated
 *  doubling the appropriate number of times.
 */

void input (double i[])
{
    int     k;
    double  j[MAX_k];

/*
 *  Read in the partial coefficients of  pi  from a precalculated file
 *  until a negative value is encountered.
 */

    k = -1;
    do
    {
        k = k + 1;
        scanf ("%lE", &i[k]);
        printf("hello\n");
        printf("%d", k);
    } while (i[k] >= 0);

/*
 *  Double the continued fraction for  pi  D-3  times using
 *  i  and  j  alternately as source and destination.  On my
 *  machine  D = 56  so  D-3  is odd; hence the following code:
 *
 *  Double twice  (D-3)/2  times,
 */
    for (k = 1; k <= (D - 3) / 2; k = k + 1)
    {
        dbleCF (i, j);
        dbleCF (j, i);
    };
/*
 *  then double once more.
 */
    dbleCF (i, j);

/*
 *  Now append a zero on the front (reciprocate the continued
 *  fraction) and the return the coefficients in  i .
 */

    i[0] = 0;
    k = -1;
    do
    {
        k = k + 1;
        i[k + 1] = j[k];
    } while (j[k] >= 0);

/*
 *  Return the length of the continued fraction, including its
 *  terminator and initial zero, in the global variable CFlength.
 */

    CFlength = k;
}

/*
 *  Given a continued fraction's coefficients in an array  i ,
 *  subroutine  nearPiOver2  finds all machine representable
 *  values near a integer multiple of  pi/2  in the current binade.
 */

void nearPiOver2 (double i[])
{
    int     k,                  /* subscript for recurrences    (see
                                   handout) */
            K;                  /* like  k , but used during cancel. elim.
                                   */
    double  p[MAX_k],           /* product of the q's           (see
                                   handout) */
            q[MAX_k],           /* successive tail evals of CF  (see
                                   handout) */
            j[MAX_k],           /* like convergent numerators   (see
                                   handout) */
            tmp,                /* temporary used during cancellation
                                   elim. */
            mk0,                /* m[k - 1]                     (see
                                   handout) */
            mk,                 /* m[k] is one of the few ints  (see
                                   handout) */
            mkAbs,              /* absolute value of m sub k
                                */
            mK0,                /* like  mk0 , but used during cancel.
                                   elim. */
            mK,                 /* like  mk  , but used during cancel.
                                   elim. */
            z,                  /* the object of our quest (the argument)
                                */
            m0,                 /* the mantissa of z as a D-bit integer
                                */
            x,                  /* the reduced argument         (see
                                   handout) */
            ldexp (),           /* sys routine to multiply by a power of
                                   two  */
            fabs (),            /* sys routine to compute FP absolute
                                   value   */
            floor (),           /* sys routine to compute greatest int <=
                                   value   */
            ceil ();            /* sys routine to compute least int >=
                                   value   */

 /*
  *  Compute the q's by evaluating the continued fraction from
  *  bottom up.
  *
  *  Start evaluation with a big number in the terminator position.
  */

    q[CFlength] = 1.0 + 30;

    for (k = CFlength - 1; k >= 0; k = k - 1)
        q[k] = i[k] + 1 / q[k + 1];

/*
 *  Let  THRESHOLD  be the biggest  | x |  that we are interesed in
 *  seeing.
 *
 *  Compute the p's and j's by the recurrences from the top down.
 *
 *  Stop when
 *
 *        1                          1
 *      -----   >=  THRESHOLD  >   ------    .
 *      2 |j |                     2 |j  |
 *          k                          k+1
 */

    p[0] = 1;
    j[0] = 0;
    j[1] = 1;
    k = 0;
    do
    {
        p[k + 1] = -q[k + 1] * p[k];
        if (k > 0)
            j[1 + k] = j[k - 1] - i[k] * j[k];
        k = k + 1;
    } while (1 / (2 * fabs (j[k])) >= THRESHOLD);

/*
 *  Then  mk  runs through the integers between
 *
 *                  k        +                   k        +
 *              (-1)  e / p  -  1/2     &    (-1)  f / p  -  1/2  .
 *                         k                            k
 */

    for (mkAbs = floor (e / fabs (p[k]));
            mkAbs <= ceil (f / fabs (p[k])); mkAbs = mkAbs + 1)
    {

        mk = mkAbs * sgn (p[k]);

/*
 *  For each  mk ,  mk0  runs through integers between
 *
 *                    +
 *              m  q  -  p  THRESHOLD  .
 *               k  k     k
 */

        for (mk0 = floor (mk * q[k] - fabs (p[k]) * THRESHOLD);
                mk0 <= ceil (mk * q[k] + fabs (p[k]) * THRESHOLD);
                mk0 = mk0 + 1)
        {

/*
 *  For each pair  { mk ,  mk0 } , check that
 *
 *                             k
 *              m       =  (-1)  ( j   m  - j  m   )
 *               0                  k-1 k    k  k-1
 */
            m0 = (k & 1 ? -1 : 1) * (j[k - 1] * mk - j[k] * mk0);

/*
 *  lies between  e  and  f .
 */
            if (e <= fabs (m0) && fabs (m0) <= f)
            {

/*
 *  If so, then we have found an
 *
 *                              k
 *              x       =  ((-1)  m  / p  - m ) / j
 *                                 0    k    k     k
 *
 *                      =  ( m  q  - m   ) / p  .
 *                            k  k    k-1     k
 *
 *  But this later formula can suffer cancellation.  Therefore,
 *  run the recurrence for the  mk 's  to get  mK  with minimal
 *   | mK | + | mK0 |  in the hope  mK  is  0  .
 */
                K = k;
                mK = mk;
                mK0 = mk0;
                while (fabs (mK) > 0)
                {
                    p[K + 1] = -q[K + 1] * p[K];
                    tmp = mK0 - i[K] * mK;
                    if (fabs (tmp) > fabs (mK0))
                        break;
                    mK0 = mK;
                    mK = tmp;
                    K = K + 1;
                };

/*
 *  Then
 *              x       =  ( m  q  - m   ) / p
 *                            K  K    K-1     K
 *
 *  as accurately as one could hope.
 */
                x = (mK * q[K] - mK0) / p[K];

/*
 *  To return  z  and  m0  as positive numbers,
 *   x  must take the sign of  m0  .
 */
                x = x * sgn (m0);
                m0 = fabs (m0);

/*d
 *  Set  z = m0 * 2 ^ (binade+1-D) .
 */
                z = ldexp (m0, binade + 1 - D);

/*
 *  Print  z (hex),  z (dec),  m0 (dec),  binade+1-D,  x (hex), x (dec).
 */

                printf ("%08lx %08lx    Z=%22.16E    M=%17.17G    L+1-%d=%3d    %08lx %08lx    x=%23.16E\n", hex (z), z, m0, D, binade + 1 - D, hex (x), x);

            }
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

use*_*136 9

理论

首先让我们注意使用单精度算法的差异.

  1. [等式8]最小值f可以更大.由于双精度数是单精度数的超集,因此最接近single倍数的数字2/pi只能比〜更远2.98e-19,因此固定算术表示中的前导零数f必须最多为61个前导零(但可能会更少).表示此数量fdigits.

  2. [9之前的等式]因此,代替121位,y必须精确到fdigits+ 24(单精度中的非零有效位)+7(额外保护位)= fdigits+ 31,并且最多92.

  3. [等式9]"因此,与x指数的宽度一起,2/pi必须包含127(最大指数single)+ 31 + fdigits,或158 + fdigits且最多219位.

  4. [2.5小节]大小Ax二进制点之前的零个数决定(并且不受移动到的影响single),而大小C由9之前的公式确定.

    • 对于大x(x> = 2 ^ 24),x看起来像这样:[24位,M zeros].通过相乘A,其大小是所述第一M比特2/pi,将导致一个整数(的零点x将刚刚转移到一切的整数).
    • 选择CM+d一点开始2/pi将导致产品x*C的大小最多d-24.在双精度中,d选择为174(而不是24,我们有53),因此产品的大小最多为121.在中single,足以选择d这样d-24 <= 92,或者更确切地说,d-24 <= fdigits+31.也就是说,d可以选择为fdigits+55,或最多116.
    • 结果,B应该是最多116位的大小.

因此,我们有两个问题:

  1. 计算fdigits.这涉及从链接的论文中阅读参考文献6并理解它.可能不那么容易.:)据我所知,这是唯一使用的地方nearpi.c.
  2. 计算B,相关的位2/pi.由于M受到127的限制,我们可以计算2/pi离线的前127 + 116位并将它们存储在一个数组中.见维基百科.
  3. 计算y=x*B.这涉及乘以x116位数.这是使用第3节的地方.块的大小选择为24,因为2*24 + 2(乘以两个24位数,并添加3个这样的数)小于double53 的精度(因为24除以96).single出于类似的原因,我们可以使用大小为11位的块进行算术运算.

注意 - 技巧B仅适用于指数为正(x> = 2 ^ 24)的数字.

总结一下 - 首先,你必须double精确地解决问题.你的Matlab代码在double精度上也不起作用(尝试删除single和计算sin(2^53),因为你twooverpi只有53位有效位,而不是175位(而且无论如何,你不能在Matlab中直接乘以这样精确的数字).其次,该方案应该适用于使用single,并且再一次,关键问题正好代表2/pi,并支持高精度数字的乘法.最后,当一切正常时,你可以尝试找出更好的方法fdigits来减少你必须存储和乘法的位数.

希望我没有完全离开 - 欢迎评论和矛盾.

作为一个例子,让我们计算sin(x)x = single(2^24-1)有效位(M= 0)之后没有零的位置.这简化了查找B,B由前116位组成2/pi.由于x具有24位和B116位的精度,因此产品

y = x * B
Run Code Online (Sandbox Code Playgroud)

将根据需要具有92位精度.

链接文章的第3部分描述了如何以足够的精度执行此产品; y在我们的例子中,相同的算法可以与大小为11的块一起使用.作为一个苦差事,我希望我没有明确地这样做,而是依赖于Matlab象征性的数学工具箱.这个工具箱为我们提供了这个vpa函数,它允许我们以十进制数字指定数字的精度.所以,

vpa('2/pi', ceil(116*log10(2)))
Run Code Online (Sandbox Code Playgroud)

将产生2/pi至少116位精度的近似值.因为vpa它的精度参数只接受整数,所以我们通常不能精确地指定数字的二进制精度,所以我们使用次优的.

以下代码sin(x)根据论文计算single精度:

x = single(2^24-1);
y = x *  vpa('2/pi', ceil(116*log10(2)));    % Precision = 103.075
k = round(y);
f = single(y - k);
r = f * single(pi) / 2;
switch mod(k, 4)
    case 0 
        s = sin(r);
    case 1
        s = cos(r);
    case 2
        s = -sin(r);
    case 3
        s = -cos(r);
end
sin(x) - s                                   % Expected value: exactly zero.
Run Code Online (Sandbox Code Playgroud)

(y使用得到的精度Mathematica,结果证明是比Matlab:) 更好的数值工具)

libm

这个问题的另一个答案(之后已被删除)引导我实现libm,尽管它在双精度数字上起作用,但却非常彻底地遵循了链接的论文.

请参阅文件s_sin.c以获取包装器(链接纸中的表2显示为switch文件末尾的语句),e_rem_pio2.c用于参数减少代码(特别感兴趣的是包含前396个十六进制数字的数组)的2/pi,起始于线69).