$a + b = a$ in machine precision

98 Views Asked by At

I have the following statement:

"If $a + b = a$, then $b = 0$" may not true with the floating point operations. Actually, if $|y| ‎< (\varepsilon / B) |x|$, then $fl(x+y) = x$, where $\varepsilon$ is the machine precision, $B$ is the basis of the number system, and $fl(x+y)$ denotes the value of the expression as obtained by the floating-point arithmetic.

How can I prove this?

1

There are 1 best solutions below

0
On BEST ANSWER

Let's concentrate on binary numbers, i.e. $B=2$. Suppose you have floating point numbers with 6 digits mantissa, and implied one before the decimal point. So for example, $x$ might look like this:

$$ x = 1.\underbrace{001001}_{\text{6 digits}}{}_2 \times 2^4 $$

In this case, the unit in the last place (ULP) is

$$ \operatorname{ULP}(x) = 0.000001_2 \times 2^4 = 2^{4-6} = 2^{-2} $$

This measures the distance to the nearest different floating point number. If you round to nearest, you'll have to add at least half that amount to obtain a number different from $x$. In other words:

$$ x\neq\operatorname{fl}(x+y) \quad\Leftrightarrow\quad \lvert y\rvert\ge\frac{\operatorname{ULP}(x)}{2}$$

Now you'll have to translate concepts. Due to the size of the mantiassa, you have $\epsilon=2^{-6}$, which is the first factor in my ULP computation. Furthermore, the mantissa of $x$ will be between $1$ (inclusive) and $2$ (exclusive), since it is one point something.

$$ 2^4 = 1\times 2^4 \le \lvert x \rvert \lt 2\times 2^4 = 2^5 \\ \operatorname{ULP}(x) = 2^{-2} \le \epsilon\,\lvert x\rvert < 2^{-1} = 2\,\operatorname{ULP}(x) $$

So this term $\epsilon\,\lvert x\rvert$ in your question is on the same order as $\operatorname{ULP}(x)$, but it can be almost twice as large.

For this reason, I do not agree with the statement as you phrased it.

Counterexample:

\begin{align*} x &= 1.001001_2\times 2^4 \\ \epsilon &= 0.000001_2 = 2^{-6} \\ \frac{\epsilon}{2}\,\lvert x\rvert &= 1.001001_2 \times 2^{-3} \\ \frac{\operatorname{ULP}(x)}{2} &= 1.000000_2 \times 2^{-3} = 2^{-3} \\ y &= 1.001000_2 \times 2^{-3} \\ x+y &= 1.0010011001 \times 2^4 \\ \operatorname{fl}(x+y) &= 1.001010 \times 2^4 > x \\ \end{align*}

To correct the statement, I'd divide $\epsilon$ not only by $B$ but by $2B$. This is because in general, $\epsilon\,\lvert x\rvert< B\operatorname{ULP}(x)$, so dividing by $B$ results in a bound at least as tight as ULP, and dividing by two will take care of the round-to-nearest behavior where you may only differ by half an ulp without changing the value.