Exact representation of floating point numbers

4.3k Views Asked by At

Why do 1000.5, 1/16 and 1.5/32 have an exact representation in an arbitrary (finite) normalized binary floating point number system but 123.4, 0.025 and 1/10 don't? How can this easily been seen without trying to create the complete floint point number?

4

There are 4 best solutions below

0
On BEST ANSWER

Written as fractions in lowest terms, the denominator is a power of $2$ for those having a finite binary representation

So

  • $1000.5 = \dfrac{2001}{2^1}$,
  • $1/16=\dfrac{1}{2^4}$ and
  • $1.5/32=\dfrac{3}{2^8}$

while

  • $123.4= \dfrac{617}{2^0\times 5}$,
  • $0.025= \dfrac{1}{2^3\times 5}$ and
  • $1/10= \dfrac{1}{2^1\times 5}$

all having non-powers of $2$ in the denominator.

By comparison, for decimal fractions to have a finite representation, the denominator of the lowest terms fraction should be a a power of $2$ times a power of $5$ since the the prime factorisation of $10$ is $2 \times 5$

0
On

The numbers that can be represented with a finite binary floating point representation are called the dyadic rationals. They comprise all numbers that can be represented in the form $\frac{i}{2^j}$ where $i$ and $j$ are integers and $j \ge 0$. $123.4 = \frac{1234}{100} = \frac{617}{50}$ etc. cannot be represented in this form.

0
On

Another way of stating what the other answers have already is that the numbers with an exact floating point representation have a terminating decimal representation in base $2$.

So, $1.5 = 1.1_2$, and $1.875 = 1.111_2$, but $1/10 = 0.00011001100110011 ... _2$.

0
On

Besides the number being a dyadic rational, it is necessary that the binary representation use no more bits than the floating point mantissa. $1+2^{-64}$ has a denominator that is a power of $2$ but will require $65$ bits to represent it. Unless you are using words longer than $64$ bits this will be represented as exactly $1$. All your examples have relatively short binary representations, the longest being $1000.5_{10}=1111101000.1_2,$ requiring $11$ bits. Old $32$ bit floating point numbers could store about $24$ bits of mantissa. The standard for $64$ bits floating point number is $53$ bits.