Min*_*neR 4 c# floating-point ieee-754
我有一个可以简化为以下问题陈述的问题:
给定一系列的双精度数,每个均在范围内
[0, 1e7]
,修改最后一个元素,以使数字的总和恰好等于目标数字。双精度数系列已加到epsilon(1e-7)内的目标数,但不是==。
下列代码有效,但是否可以保证满足第一句中所述要求的所有输入有效?
public static double[] FixIt(double[] input, double targetDouble)
{
var result = new double[input.Length];
if (input.Length == 0) return result;
double sum = 0;
for (int i = 0; i < input.Length - 1; i++)
{
sum += input[i];
result[i] = input[i];
}
double remainder = targetDouble - sum;
result[result.Length - 1] = remainder;
return result;
}
var arr1 = Enumerable.Repeat(Math.PI / 13, 13).ToArray();
var arr2 = FixIt(arr1, Math.PI);
Debug.Print(Math.PI.ToString("R")); //3.1415926535897931
Debug.Print(arr1.Sum().ToString("R")); //3.1415926535897922
Debug.Print(arr2.Sum().ToString("R")); //3.1415926535897931
Run Code Online (Sandbox Code Playgroud)
该问题的先前版本要求修改第一个元素,但是修改最后一个元素会将问题简化为一个已知的和和一个已知的目标,而我们只剩下是否last = target-sum
隐含那个问题sum+last == target
。
(当然,如果没有NaN,则对范围的限制也暗示对范围的限制,last
这可能会有所帮助。)
关于真正的问题:我们已经以多种方式使这个问题浮出水面,但是目前我们正在尝试做的是减少由于线性规划求解器中的数值不稳定性而出现的浮点误差。 (硬币或CBC)。例如,有6个变量都必须在[0,X]范围内,并且变量的总和也必须为X。由于数值的不稳定性,求解器有时会返回略微为负的值以及未求和的值不能精确求和X。我们已经克服了负数问题-现在只是试图解决X问题的总和。(是的,我们更改结果可能会违反一些约束,但是确保将这些数字的总和设为X的优先级更高,而其他约束并不那么重要。)
z = x-y;
这不能保证z+y == x
,而且对于找到z
这样的问题并非总是有解决的办法z+y == x
。证明如下。
我们假设IEEE-754二进制浮点算法具有四舍五入到最接近,至偶数的关系。使用了基本的64位格式,但结果适用于其他格式。请注意,64位格式使用53位有效数字,这意味着只能表示有效位数为53或更少的数字。
Consider a target x
equal to 1+2?52. Let y
be 2?53. Then, after z = x-y;
, z+y == x
evaluates to false. The arithmetic details are shown below, but:
z = x-y;
sets z
to 1, and then z+y
produces 1, which is less than x
.z
to the next representable number, 1+2?52, then z+y
produces 1+2?51, which is greater than x
.z
that makes z+y == x
true.Details:
The mathematical result of x
?y
is 1+2?53. As this has 54 significant bits (from 20 to 2?53), it is not representable, and the computed result of x-y
must be rounded. The two nearest numbers are 1 and 1+2?52. The ties-to-even rule produces the former number, 1, as the low bit of its significand is 0, while the low bit for 1+2?52 is 1.
Thus z = x-y;
sets z
to 1.
Then the mathematical result of z
+y
is 1+2?53. As above, this is rounded to 1, so the computed result of z+y
is 1. So z+y == x
compares 1 to 1+2?52 and produces false.
Furthermore, no value of z
could make the comparison true. If we increment z
by the smallest available step, from 1 to 1+2?52, the mathematical sum of z
+y
is then 1+2?52+2?53. This is midway between the two representable numbers 1+2?52 and 1+2?51. The former has a low bit of 1, and the latter has a low bit of 0, so the computed result of this z+y
is 1+2?51, which is of course not equal to 1+2?52.
Floating-point addition is weakly monotonic, so there are no values of z
that would produce 1+2?52 for z+y
.
归档时间: |
|
查看次数: |
145 次 |
最近记录: |