How do you add 8-bit floating point with different signs?

1.9k Views Asked by At

Hi I have some trouble with how should I add two 8-bit floating points with different signs. The question is here,

1 100 1100 + 0 101 1011 =

Thee 1st bit is the sign, next 3 bits are the exponent and the last 4 bits are the matissa.

Thank you, have a good day.

2

There are 2 best solutions below

0
On
  • convert the mantissa with the smallest exponent to the base of the largest exponent
  • process integer addition of mantissas (accounting for the sign)
  • possibly adjust the exponent so as to renormalize the resulting mantissa
2
On

First, in this form of binary representation of a floating number, the exponent is typically offset by a pre-agreed amount. For simplicity, let's say the offset is $(100)_2$. We can expand the two bit patterns as two numbers. $$ \require{enclose} \def\xD{{}_{10}} \def\xH{{}_{16}} \def\xB{{}_{2}} \newcommand{\xP}[2][black]{\enclose{box}{\color{#1}{\small\verb/#2/}}} \begin{array}{rcl} \xP[red]{1}\xP[orange]{100}\xP[green]{1100} &+& \xP[black]{0}\xP[blue]{101}\xP[magenta]{1011}\\ &\Downarrow\\ \color{red}{-}(1.\color{green}{1100})\xB \times 2^{(\color{orange}{100}) \xB-(100)\xB} &+& (1.\color{magenta}{1011})\xB \times 2^{(\color{blue}{101})\xB - (100)\xB} \\ &\Downarrow\\ \underbrace{-\left(1+\frac12+\frac14\right) \times 2^0 }_{= -\frac74} &+& \underbrace{\left(1 + \frac12 + \frac18 + \frac{1}{16}\right)\times 2^1}_{= \frac{27}{8}} \end{array} $$ The bit pattern on the left corresponds to the number $-\frac{7}{4}$, while the one on the right corresponds to $\frac{27}{8}$. Their sum equals to $\frac{13}{8}$. Since $$\frac{13}{8} = (1 + \frac12 + \frac18) \times 2^0 = (1.\color{gold}{1010})\xB \times 2^{(\color{orange}{100})\xB - (100)\xB}$$ The bit pattern for the result is $\xP{0}\xP[orange]{100}\xP[gold]{1010}$.

You can start from a different offset. However, the two numbers at hand and their sum all have compatible magnitude. There is no issues of underflow or overflow. Independent of what offset you choose, you will get the same final bit pattern.