Range of number systems

461 Views Asked by At

enter image description here

I don't understand how to get the fraction part. This is what I came up with for the integer part.

A) For 12 bit unsigned = 0 to 4095

B) For 12 bit signed = -2048 to 2047

C) For 12 bit in 2's complement -4096 to 4095

Hopefully someone can explain the fraction part, thanks!

1

There are 1 best solutions below

2
On BEST ANSWER

Fixed point means that fractional values are multiplied with some fixed scaling factor, and the result is stored as an integer (after truncation or rounding). In other words, to get the true value, you interpret the stores value as an integer and multiply it with the inverse of the scaling factor. Usually, the scaling factor is a power of the whatever base you pick for representing integers. The exponent then tells you how many fractional digits are preserved by this transform.

In the case of (a), that means that fractional values are multiplied with $2^{12}$, and the result is stores as an 24-bit integer. To get the actual value back, you thus need to multiply the stored integer with $2^-{12}$. The largest possible value that can yield is $(2^{24}-1)2^{-12} = 2^{12} - 2^{-12}$. The smallest value is, unsurprisingly, $0\cdot 2^{-12} = 0$.

In the case of (b), the (signed!) integer is stored as one sign bit plus 23 magnitude bits, the scaling factor is again $2^{12}$, since it can store 12 fractional digits. The maximum value the integer can take is $(2^{23}-1)$, which corresponds to an actual value of $2^{11} - 2^{-12}$. The minimal value is $-(2^{11} - 2^{-12}) = -2^{11} + 2^{-12}$, since the minimum value the integer can take is $-(2^{23}-1)$.

I'll leave (c) to you.


You can also view this in another way. Fractional values work in the binary system just as they do in the decimal system - the first digit to the right of the "decimal" point has weight $\frac{1}{2}$, the next one $\frac{1}{4}$ and so on.

Thus, if as in the case of (a) you have 12 digits to the left of the "decimal" point and 12 to the right, and if all these digts are $1$, the value is $$ \underbrace{2^{11} + 2^{10} + \ldots + 2^0}_{=2^{12}-1} + \underbrace{2^{-1} + \ldots 2^{-12}}_{=1 - 2^{-12}} = 2^{12} - 2^{-12} $$ Since all weights are positive, that must be the largest representable number - if any bit was zero, you'd add fewer values, so the result would surely be smaller.

In the case of (b), you only have 11 digits to the left of the decimal point, and now the weights are either positive (if the sign bit is $0$), or all negative (if the sign bits is $1$). You thus get the smallest number by setting all bits to $1$, and the resulting value is $$ \underbrace{-2^{10} - 2^{10} - \ldots - 2^0}_{=-(2^{11}-1)} + \underbrace{-2^{-1} - \ldots 2^{-12}}_{=-(1 - 2^{-12})} = -2^{11} + 2^{-12} $$