$t$ bits for the fraction of a float $\bar{x}$, then the distance between $\bar{x}$ and an adjacent float is no more than $2^{e−t}$

40 Views Asked by At

I'm reading this paper on floating-point numbers, and it states roughly that:

If fractional part of a floating point number requires $t$ bits, then the distance between float $\bar{x} = \pm 0.b_1b_2..b_{t}*2^e$ and an adjacent float is no more than $2^{e−t}$

Why is that? Why exactly $2^{e−t}$?

Then, it states that

Dividing $2^{e−t}$ by $\bar{x}$ gives an upper bound on the relative distance between any two floats $2^{1−t}$

I understood that this is a relative distance, because we're dividing by $\bar{x}$, but how is $2^{1−t}$ exactly obtained?

1

There are 1 best solutions below

1
On

The distance is exactly $2^{e-t}$ (in almost all cases) because if $\bar y$ is a floating-point number adjacent to $\bar x$ then $$ \lvert x-y \rvert = 0.\overbrace{00\ldots00}^{t-1\text{ zeros}}1_2\times 2^e = 2^{-t} \times 2^e = 2^{e-t}. $$ In other words, when you only have $t$ bits of precision, the smallest change you can make in the value of the number is to add or subtract $1$ in the last bit position.

I say this is true "in almost all cases" because there is one class of numbers, namely the ones written $\pm 0.1_2 \times 2^e$, for which the floating-point number on one side is closer than the floating-point number on the other side. Specifically, \begin{align} 0.1_2\times2^e + 2^{e-t} &= 0.1_2\times2^e + 0.\overbrace{00\ldots00}^{t-1\text{ zeros}}1_2 \times 2^e\\ &= 0.1\overbrace{00\ldots00}^{t-2 \text{ zeros}}1_2 \times 2^e \\ 0.1_2\times2^e - 2^{e-t-1} &= 0.1_2\times2^e - 0.\overbrace{00\ldots00}^{t-1\text{ zeros}}1_2 \times 2^{e-1}\\ &= 1 \times 2^{e-1} - 0.\overbrace{00\ldots00}^{t-1\text{ zeros}}1_2 \times 2^{e-1}\\ &= 0.{\overbrace{11\ldots11}^{t \text{ ones}}}_2 \times 2^{e-1} . \end{align}

So the distance between $0.1\times2^e$ and the adjacent floating-point number on one side is $2^{e-t}$ but the distance to the adjacent floating-point number on the other side is $2^{e-t-1}$. But notice that $2^{e-t-1} < 2^{e-t}$, so it is still true that the distance is no greater than $2^{e-t}$.

When someone says "no greater than", they do not mean "exactly equal to". The number $2^{e-t}$ is merely an upper bound that just happens to be exactly met in most cases.

We then have that the relative distance from $\bar x$ to an adjacent floating-point number is at most $$ \left\lvert \frac{2^{e-t}}{\bar x} \right\rvert = \frac{2^{e-t}}{\lvert \bar x\rvert}. $$ We're looking for an upper bound on this expression, so we want to choose an $\bar x$ that makes the expression as large as it can possibly be. For any given numerator, the largest possible value of a fraction occurs when we make the denominator as small as we can. Assuming that $\bar x$ is not a denormal number, the smallest its absolute value can be (while still having exponent $e$) is $0.1_2 \times 2^e$. (The next smaller floating-point number has exponent $e-1$). So for numbers with exponent $e$, the largest possible relative error is $$ \frac{2^{e-t}}{0.1_2 \times 2^e} = \frac{2^{e-t}}{2^{e-1}} = 2^{(e-t)-(e-1)} = 2^{1-t}. $$

We can also see that this upper bound is reached at the relative distance between $\bar x = 0.1_2 \times 2^e$ and the next larger floating-point number, $\bar x + 2^{e-t}$, so this is the least upper bound on the error.

Here's an example using literal numbers instead of abstract symbols $e$ and $t$. Suppose we are using floating-point numbers with $11$ bits of precision, that is, $t=11$, and a five-bit exponent, in which all numbers are normalized. The exponent is biased in such a way that $e=7$, $e=8$, $e=-2$, and $e=-3$ all are possible exponent values.

Consider the number $0.1_2 \times 2^8$. This number can be represented exactly in this floating-point number system. It happens that $0.1_2 \times 2^8 = 32768.$ The next larger exact floating-point number is $0.10000000001_2 \times 2^8 = 2^7 + 2^{-3} = 32768.125.$ The next smaller exact floating-point number is $0.11111111111_2 \times 2^7 = 2^7 - 2^{-4} = 32767.9375.$ Of these two numbers, $0.10000000001_2 \times 2^8$ is farther from $0.1_2 \times 2^8$ and produces the larger relative error. That relative error is $$ \left\lvert\frac{0.10000000001_2 \times 2^8 - 0.1_2 \times 2^8} {0.1_2 \times 2^8}\right\rvert = \frac{0.00000000001_2 \times 2^8}{0.1_2 \times 2^8} = \frac{2^{-3}}{2^7} = 2^{1-11}. $$

In this floating-point number system, the number $0.00000000001_2 \times 2^8$ is stored in the format $0.1_2 \times 2^{-2}.$ The next larger exact number is $0.10000000001_2 \times 2^{-2}$ and the next smaller exact floating-point number is $0.11111111111_2 \times 2^{-3}.$ The error relative to $0.00000000001_2 \times 2^8$ therefore is $$ \left\lvert\frac{0.10000000001_2 \times 2^{-2} - 0.1_2 \times 2^{-2}} {0.1_2 \times 2^{-2}}\right\rvert = \frac{0.00000000001_2 \times 2^{-2}}{0.1_2 \times 2^{-2}} = \frac{2^{-13}}{2^{-3}} = 2^{1-11}. $$

In other words, we get the exact same relative error, which happens to be the maximum relative error possible for this floating-point system. The reason it works this way is the condition in the definition of this numbering system that "all numbers are normalized." That means we do not allow numbers to be stored with an arbitrary number of leading zeros, but require that the most significant bit always is stored in the same place in the mantissa.

The IEEE-754 floating-point number system allows for some numbers to be stored with extra leading zero. These are called denormal numbers and they are allowed to occur only for the values that are closest to zero. (In single-precision IEEE-754 format, the denormal numbers all have absolute values less than $2^{-126}$.) The upper bound $2^{1-t}$ is not valid for denormal numbers $\bar x$; the second-smallest denormal is twice as large as the smallest denormal. But the page linked to the question rules out denormals by stating that the first bit after the "decimal" point is always $1$.