floating point subtraction for binary numbers

8.5k Views Asked by At

Consider that I want to do a binary operation on the following floating point numbers: 0.35-0.62

I can reach the end but I can not figure out how the sign bit is determined.

1) first we write the numbers in binary. Assume that we can represent 4 digits in the fraction part

0.35 -> 0.010110...   -> 1.0110 * 2^(-2)
0.62 -> 0.1001111.... -> 1.0011 * 2^(-1)

2) We have to modify the smaller exponent to make it equal to the larger one. So:

1.0011 * 2^(-1)   =   0.1011 * 2^(-1)

3) Subtract the mantises:

    0.1011
    1.0011-
 ----------
{1} 1.1000

{1} means there is a carry bit. The new mantis in 1.1000. So how we determine the sign bit? We can do that operation using 2's complement like this:

       0.1011
       0.1101 +
  ------------
  {0}  1.1000

The carry bit is zero. So is that a positive number or negative one? How the sign is determined?

4) Assuming that we know the sign is negative! the result is 1.1000 * 2^(-1).

Any idea on determining the sign bit?