Floating-point arithmetic error

277 Views Asked by At

Suppose you need to generate $n + 1$ equally spaced points on the interval $[a, b]$, with spacing $h = \frac{b-a}{n}$. In floating-point arithmetic, which of the following methods: $x_0=a$, $x_k = x_{k-1}+h$ ($k=1,2,...n$) or $x_k = a+kh$ ($k=0,...,n$) is better?

My attemp: I think the first one is better, because only the multiplication error resulting from $kh$ may be big, and that error will be increased significantly by the addition with $a$ due to the relatively large distance between $a$ and $kh$. On the other hand, the first one only handles error in addition, and since the distance between $x_{k-1}$ and $h$ is not as large as those between $a$ and $kh$, we get the conclusion.

Can anyone help review my argument above to see if it's correct? Many thanks for your help.

2

There are 2 best solutions below

14
On BEST ANSWER

Hint
Let's make a really bad FLOP unit, rounding to integers. Try to divide $[0,5]$ into $16$ parts of length $\frac13$.

Method 1 will give you $x_i = 0$ for all $i$ because $0 + \frac13 = 0.\bar3$ is rounded to $0$ in every iteration.

Method 2 will give you $(0,\frac13,\frac23,1,1\frac13,1\frac23,2,\ldots) \to (0,0,1,1,1,2,2,2,3,3,3,4,4,4,5,5)$

The same problem occurs on a smaller scale with a good FLOP. In general, it's always worse to use erroneous input ($x_{i+1} = f(x_i)$) for computations than to use perfect information ($x_{i+1} = g(a,h,i)$)


Now in general a floating point unit (FLU) is specified by the tolerances, $\epsilon, \delta$ such that $$|[a+b]-a-b| \le \epsilon \ \forall a,b\in\mathbb R\\ |[a\cdot b] - ab| \le \delta \ \forall a,b\in\mathbb R$$ This allows you to make estimates:

Method 1 $$e_i = |[x_{i-1} \pm e_{i-1} + h] - (x_{i-1} + h)| \le e_{i-1} + \epsilon \Rightarrow e_i \le i \epsilon$$ Where the worst-case happens when $x_i$ and $h$ are chosen such that the FLU makes maximum error, and the $e_{i-1}$-term is the "carried" error. Thus the error increases linearly with the number of knots. $e = \mathcal O(n)$

Method 2 $$e_i = |[x_0 + [k\cdot h]] - x_0 - kh| \le |[x_0 + kh \pm \delta] - x_0 - kh| \le \epsilon + \delta$$ Here the error is constant, even in worst case. $e = \mathcal O(1)$

Thus the error in Method 2 is at most a constant wich is much smaller than linear for growing $n$. Method 2 is thus provably better.


Notation
By $[a+b]$ I refer to the result we get when "asking" our FLU for $a+b$. In an ideal FLU, this only depends on the accurate result $a+b$, so $$[\cdot]:\mathbb R \to \mathbb R$$ can be thought of as a "round to nearest representable number"-function. Then the FLU specification tells us that $$|[x]-x| \le \epsilon$$ or in terms of the supremum norm $$\|[\cdot]-\mathrm{Id}\|_\infty \le \epsilon$$ I allow addition and multiplication to have different accuracy ($\epsilon$ and $\delta$ resp.) but when estimating the errors in worst-case (best-case is always $0$ error^^), you can think of $[\cdot]$ as a function $$[x] = \cases{x+\epsilon\\x-\epsilon}$$ whatever increases the total error. (That is out FLU delivers the greatest error possible within its specification)


Numerical Experiment
The following MATLAB-code produces a plot of the error. You'll find errB = 0 while errA already climbs to $10^{-12}$ for $h = 10^{-5}$.

N = 10.^(1:5);
errA = 0*N;
errB = 0*N;
for i=1:length(N)
    n=N(i);
    h=1/n;
    xA=0;
    xB=0;
    for k=1:n
        xA = xA + h;
    end
    xB = n*h;
    errA(i) = abs(xA-1);
    errB(i) = abs(xB-1);
 end
 loglog(N,errA,'r-',N,errB,'b-');
4
On

I believe the second is better. You can try it in python:

In [1]: 0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1
Out[1]: 0.9999999999999999

In [2]: 10*0.1
Out[2]: 1.0