Computing number of significant digits

76 Views Asked by At

Consider the problem of computing the volume of a hollow sphere via the formula

$$ V(w) = \frac{4}{3} \pi ((r + w)^3 - r^3)$$

where $r > 0$ is the (inner) radius of the sphere and $w \geq 0$ is the width of the wall of the sphere.

Now consider the case $r = 1$ and $w = 10^{-15}$. Without modifying the formula, MATLAB (with double precision, so more than 15 significant digits) returns the value

$$ V = 1.396147399203453 ~~e-14 ~~.$$

(At least) how many of the digits of this result are correct?

Attempt:

Computing the term in the brackets yields

$$(1 + 10^{-15})^3 - 1^3 = 3 \cdot 10^{-15} + 3 \cdot 10^{-30} + 10^{-45} = 3 + 3 \cdot 10^{-15} + 10^{-30} ~~e-15$$

Since we are guaranteed at least 15 significant digits, we can deduce that the first 15 digits of the middle expression above is correct (but not necessarily the digits after that i.e. the ones coming from $3 \cdot 10^{-30} + 10^{-45}$). That is, we cannot guarantee that the 15th digit in the last (right-most) expression is correct.

Multiplying by $\frac{4}{3} \pi$ gives

$$ 1.396147399203453 ~~e-14 + 1.396147399203453 ~~e-29 + 1.396147399203453 ~~e-44 $$

Since we could only guarantee that the first 14 digits in the last expression were correct, we can only guarantee it for the first 14 digits here, which corresponds to $1.39614739920345$.


  1. Is the result correct? and is the reasoning correct?
  2. In any case I feel like the argument is very akward. I would appreciate some comments for improvement.

Thank you in advance.

1

There are 1 best solutions below

10
On BEST ANSWER

Since you've already discovered that only 1 digit was correct, let me point out where you went wrong. While there is a bit of error coming into the calculation of $\frac 43\pi\approx 4.188790204786391$, it is negligible compared to the massive cancellation error.

Because double precision has 52 bits to represent the numeric portion, an arbitary number is not stored exactly. Let $x = 1 + 10^{-15}$. Note that this is the number $1.000000000000001$. Let $x_1$ be the same value as stored in a double precision format. That is $$x_1 = 1.\underbrace{00000000000000000000000000000000000000000000000000}_{50}10_b\times 2^{0}$$ Since there are only $52$ bits to the right of the radix in double precision, this is the closest value to $x$ that can be stored (note that I am leaving the exponent portion in decimal because nothing special is going on there - the action is in the mantissa). This number is $$x_1 = 1 + 2^{-51} \approx 1.0000000000000004441$$ and therefore, $$x - x_1 \approx 5.559 \times 10^{-16}$$ Let $\epsilon = 10^{15}(x - x_1) \approx 0.5559$, so $x_1 = x - \epsilon\times 10^{-15} = 1 + (1-\epsilon)\times 10^{-15}$.

The next step is to cube: $$x_1^3 = 1^3 + 3\cdot 1^2(1-\epsilon)10^{-15} +O(10^{-30})$$ This result is once again stored in a double-precision register. Call the result $x_2$: $$x_2 = 1.\underbrace{0000000000000000000000000000000000000000000000000}_{49}110_b\times2^{0}$$

Finally, we subtract $1^3 = 1$ to get $x_3$: $$\begin{align}x_3 &= 0.\underbrace{0000000000000000000000000000000000000000000000000}_{49}110_b\times2^{0}\\&={1.10\underbrace{00000000000000000000000000000000000000000000000000_b}_{50}}\times2^{-50}\end{align}$$ Where the latter expression is what is stored. Note that only the first three bits came from the value stored for $x_2$. The other 50 zero bits are not significant. The correct value for these bits were lost because they couldn't be stored in $x_1$ or $x_2$. It is only because $x_3$ got rid of the leading $1$ that suddenly we could shift the entire thing left by 50 bits and change the exponent. This is what cancellation error does. We knew $x_2$ to a precision of 52 bits. We know $1$ to infinite precision, but we only know $x_3 = x_2 - 1$ to a precision of 3 bits. All the rest are unknown bits set to $0$ since they have to be something.

In the alternate approach where one notes that $(r+w)^3 - r^3 = 3r^2w + 3rw^2 + w^3 = 3\times 10^{-15} + 3\times 10^{-30} + 10^{-45}$, the first term is stored as $$1.1011000001011000011101101110010110110000000100100000_b \times 2^{-50}$$ and aligning the second term with that to allow addition is (if I haven't messed up) $$0.0000000000000000000000000000000000000000000000000111_b \times 2^{-50}$$ The final term is too small to contibute anything at all, so the computed value will be $$1.1011000001011000011101101110010110110000000100100111_b \times 2^{-50}$$ Which is indeed the first 53 significant bits of $(1+10^{-15})^3 - 1$, and considerably different from the $x_3$ value computed before.