How can I add the following two 32-bit IEEE floating-point numbers in binary?
FEDCBA98(base 16) + 89ABCDEF(base 16)
= a 33-bit binary number.
How can this be possible?
How can I add the following two 32-bit IEEE floating-point numbers in binary?
FEDCBA98(base 16) + 89ABCDEF(base 16)
= a 33-bit binary number.
How can this be possible?
Copyright © 2021 JogjaFile Inc.
First, let's work out the binary representations: \begin{align} x = \mathrm{FEDCBA98}_{16} &= 1111\, 1110 \, 1101 \, 1100 \, 1011 \, 1010 \, 1001 \, 1000_{2} \\ y = \mathrm{89ABCDEF}_{16} &= 1000\, 1001 \, 1010 \, 1011 \, 1100 \, 1101 \, 1110 \, 1111_{2} \end{align} Then let's convert the binary to IEEE 32-bit floating point: Bit have the sign bit set, so both are negative. The exponent of $x$ is $e_{x} + 127 = 11111101_{2} = 253$, so $e_{x} = 126$. The exponent of $y$ is $e_{y} + 127 = 00010011_{2} = 19$ so that $e_{y} = -108$.
The significands are \begin{align} 1.b^{x}_{22}b^{x}_{21} \cdots &= 1.10111001011101010011000_{2} \\ 1.b^{y}_{22}b^{y}_{21} \cdots &= 1.01010111100110111101111_{2} \end{align} When we multiply by the exponents, we shift the bits of $x$ to the left 126 places, and the bits of $y$ to the right by 108 places. Therefore, all the bits of $y$ get truncated away when adding to $x$, i.e., $x+y =x$. You can see this with the following code: