Explanation for MATLAB floating point number calculation?

152 Views Asked by At

I am a beginner studying scientific computation, more specifically floating point numbers and precision in matlab. When testing the outputs of 2 of the following equations, I am not sure how matlab computed the results and why it is computed this way.

    y = @(x) (exp(x)-1-x)/x^2; 
    z = @(x) (exp(x)-x-1)/x^2; 
    x = 1e-10; 
    fprintf(’x=%.5e\n  y=%.5e\n  z=%.5e\n’, x, y(x), z(x))

Result:
x=1.00000e-10
y=8.27404e+02
z=0.00000e+00

As you can see, the only difference between y and z is that the numerator's order placement changed from $\exp(x)-1-x$ to $\exp(x)-x-1$. Is subtraction not being associative the reason why the result of y and z are different?
Why is z's result exactly 0 instead of a precise number? Is it because matlab's double precision arithmetic rounded it to 0?

Any clarification would be appreciated.

2

There are 2 best solutions below

0
On

$x=10^{-10}$ has a "filled" mantissa in the binary floating-point encoding, as it is not a power of $2$.

As $x^2=10^{-20}$ is zero relative to $1$ in floating point precision, $\exp(x)$ is $1+x$ rounded to floating point precision (or 1 ulp larger - unit-in-the-last-place).

$\exp(x)-1$ returns thus a truncated version of $x$, essentially its leading 5 or 6 digits. Consequently $(\exp(x)-1)-x$ will return the remainder of $x$ to this truncation.

In $(\exp(x)-x)-1$ the floating point errors are undone in the reverse order that they were introduced, so the result is zero (remember that $\exp(x)=1+x$ at this level).

5
On

First let's see what the answer should have been if all calculations were performed on exact real numbers (that is, with infinite precision and no roundoff):

\begin{align} \exp(x) &= 1 + \frac{x}{1!} + \frac{x^2}{2!} + \frac{x^3}{3!} + \frac{x^4}{4!} + \cdots. \\ \exp(x) - 1 - x &= \frac{x^2}{2!} + \frac{x^3}{3!} + \frac{x^4}{4!} + \cdots. \\ \frac{\exp(x)-1-x}{x^2} &= \frac{1}{2!} + \frac{x}{3!} + \frac{x^2}{4!} + \cdots \\ &= \frac12 + \frac16x + \frac1{24}x^2 + \cdots. \end{align}

So for very small $x$ we would get a result close to $\frac12,$ unlike either of the calculations MATLAB performed for you.

Working through what is apparently happening, we start with $\exp(x),$ which is very close to $1 + x$ for small $x$. Given the limited precision of a double-precision number, for small enough $x$ the remaining terms of the series ($\frac12 x^2$ and smaller terms) are unlikely to make any difference in what number finally is stored. But what is more important, the double-precision number is not even capable of representing $1 + x$ very well.

The number $x = 10^{-10}$ is not exactly representable in double precision, and the computer ends up using all the bits of precision it has in order to represent it as closely as possible. But when we add $x$ to $1,$ we have to shift the mantissa of $x$ to the right in order to add it to the mantissa of $1,$ and when we put the result back into a double-precision number we lose several bits of precision.

Likewise when we put $\exp(x)$ into a double-precision number, where $x = 10^{-10}$, we lose much of the precision of the $x$ term of the series and essentially all of the precision of the remaining terms. The result is $1 + x + \delta = 1.0000000001 + \delta,$ where the roundoff error $\delta$ is something on the order of $\pm 10^{-16}.$


So what happens when we try to compute $y$?

We subtract $1,$ and end up with $10^{-10} + \delta,$ where $\delta$ is still on the order of $\pm 10^{-16}.$ That is, instead of the usual $17$ bits (approximately) of decimal precision we enjoy in double-precision floating point numbers, we may have as little as $7$ bits of precision.

Now we subtract $x$, which is $10^{-10}$ but is represented only approximately in double precision. Still, the error of the representation of $x$ is at most something on the order of $10^{-26},$ which is very small compared to the error we already have in the representation of $\exp(x) - 1.$ So after subtracting $x$ we have $\delta'$, where $\delta'$ is either $\delta$ or something very close to $\delta,$ possibly something on the order of $\pm 10^{-16}.$

Finally we divide by $x^2,$ that is, we multiply by $10^{20}$ (approximately, since $x$ was not represented exactly in the computer). We end up with something possibly on the order of $\pm 10^{-16} \times 10^{20} = \pm 10000.$ And (no surprise) what we actually get is $827.404.$ This is actually more "accurate" than we have any right to expect; if you do the same calculations in python but set $x$ to 1.01e-10 instead of 1e-10, the result is $9525.989554036925$.


Now let's try to compute $z$.

We subtract $x$ from $\exp(x)$ and end up with $1 + \delta,$ where $\delta$ is on the order of $\pm 10^{-16}$ or less. Essentially, we get $1 + \delta''$ where $\delta''$ represents whatever bits of $x$ got lost in the roundoff. Depending on what those bits are, $\delta''$ could be $\epsilon$ (the "machine precision" of a double precision number), $-\epsilon,$ or $0.$

In this case it appears that $\delta'' = 0.$ So the result of exp(x) - x is exactly $1.$ Now we subtract $1$ and we have $0$ exactly.

Divide $0$ by a positive number and you still have $0,$ which is the final value of $z.$

For a different values of $x$ you can find that $\exp(x) - x$ does not round off to $1$ exactly, but rather to $1 \pm \epsilon.$ And then you get a value of $z$ that is not zero, but not much more accurate than the value of $y$ you got. In fact, for x = 1.04e-10 in python I find that z comes out to $-10264.635952525487.$