How can I add the following 32-bit IEEE floating-point numbers?

466 Views Asked by At

How can I add the following two 32-bit IEEE floating-point numbers in binary?

FEDCBA98(base 16) + 89ABCDEF(base 16)

= a 33-bit binary number.

How can this be possible?

1

There are 1 best solutions below

0
On BEST ANSWER

First, let's work out the binary representations: \begin{align} x = \mathrm{FEDCBA98}_{16} &= 1111\, 1110 \, 1101 \, 1100 \, 1011 \, 1010 \, 1001 \, 1000_{2} \\ y = \mathrm{89ABCDEF}_{16} &= 1000\, 1001 \, 1010 \, 1011 \, 1100 \, 1101 \, 1110 \, 1111_{2} \end{align} Then let's convert the binary to IEEE 32-bit floating point: Bit have the sign bit set, so both are negative. The exponent of $x$ is $e_{x} + 127 = 11111101_{2} = 253$, so $e_{x} = 126$. The exponent of $y$ is $e_{y} + 127 = 00010011_{2} = 19$ so that $e_{y} = -108$.

The significands are \begin{align} 1.b^{x}_{22}b^{x}_{21} \cdots &= 1.10111001011101010011000_{2} \\ 1.b^{y}_{22}b^{y}_{21} \cdots &= 1.01010111100110111101111_{2} \end{align} When we multiply by the exponents, we shift the bits of $x$ to the left 126 places, and the bits of $y$ to the right by 108 places. Therefore, all the bits of $y$ get truncated away when adding to $x$, i.e., $x+y =x$. You can see this with the following code:

#include <iostream>

union bits
{
  float x;
  int32_t i;
};

int main()
{
  bits b1, b2, b3;

  b1.i = 0xfedcba98;
  b2.i = 0x89abcdef;

  std::cout << b1.x << " + " << b2.x << " = " << b1.x+b2.x << std::endl;

  b3.x = b1.x + b2.x;

  if(b3.i == b1.i)
  {
    std::cout << "b2 got truncated away!" << std::endl;
  }
}