Floating Point Number System

714 Views Asked by At

I really have no idea of how to do these questions - in fact I have no idea of how to do any question in the paper - but I have tried to figure out what's going on in the course called Computational Mathematics but the lecturer's notes are honestly useless to someone who doesn't have a strong maths background.

The course also has a high failure.

I'm trying to find materials online but the course isn't focused on just one topic, I even asked the lecturer for a recommended book but he said there isn't one book that covers the whole module, so I'm really stuck. Here a link to the exam paper. Link

Here's the first question from last year's paper:

Question 1.

(i) How many non-unique, non-normalised, numbers can be represented in a floating-point system defined by parameters $\beta, s, m, M$? $\tag*{ [5 Marks]}$

(ii) How many unique, normalised, numbers can be represented in a floating-point system defined by parameters above? Hint: it is proportional in some way to $\beta^{s-1}$ because no number other than zero itself can start with zero. $\tag*{[8 Marks]}$

(iii) Enumerate all the non-negative, non-unique, non-normalised, numbers in the floating-point system defined by parameters $\beta=4, s=2, m=-1, M=1$ $\tag*{[8 Marks] }$

(iv) Convert the numbers enumerated above into a floating-point system with $\beta=10, s=3, m=-1, M=1 .$ Comment on their distribution and some consequences for computation. $ \tag*{[4 Marks ]}$

Please note that I'm not asking for just the solutions but an explanation and probably a link, so that I can have a background knowledge and so that I'll be able to answer similar questions myself. This is not an assignment, I'm just preparing for an exam.

Thank you. :)

Edit:

2 $\quad$ Finite-precision floating point system - FPS

Let $F(\beta, s, m, M)$ be a system where

  • $\beta$ is the base, e.g. $2,4,10,$ or $16$

  • $s$ is the number of significant digits of the mantissa in base $\beta$.

  • $e \in Z$ is an exponent, $m \leq e \leq M$

Each number $x \in\{F\}$ has the structure $$ \pm \, \underbrace{d_{1} d_{2} \ldots d_{s}}_{\text {mantissa }} \times \underbrace{\beta}_{\text {basis }}\,^{\pm e\} \text { exponent }} $$ If $x \neq 0$ then $x$ is normalised if $1 \leq d_{1} \leq \beta-1$ and $0 \leq d_{i} \leq \beta-1, i=2 \ldots s .$ If $x=0$ then $d_{1}=d_{2}=\ldots=d_{s}=0$

1

There are 1 best solutions below

2
On

What I can help is to provide an analogue using base 10 floating point numbers.

If it is non-normalized, then it has infinitely many non-unique representations. Examples are:-

6.25 = 0.625 * 10………..…… (1)

6.25 = 0.0625*100…………… (2)

6.25 = 625 * 10^(-2)……..… (3)

This is not a ‘healthy’ environment because a number has so many 'looking different' but in fact equivalent representations. In order to ensure the representation of a number is unique, normalization is necessary.

Normalization requires:-

I. All number should start as $0.d_1d_2d_3…d_s$ where the $d_i’s$ are the extracted digits.

II. The leading digit (i.e. $d_1$) must not be zero and other digits have no such a restriction. This is formally stated as $1 \le d_1 \le 10 – 1$ and $0 \le d_i \le 10 – 1$ for $i = 2, … ,s.$ At this stage, only (1) above can meet the requirement.

III. In order to make the so far representation numerically equivalent to the original, it must be compensated by multiplied a suitable exponent. That is, $*10^e$ for some suitable integer e; and e can be 0, + or –.

IV. If the number is 0, then ……

Thus, the normalized representation of $6.25$ is $0.6250000000...00 * 10^1$; the number of 0s appended depends on the size of the ‘container’ or ‘WORD’.

If the size of the WORD and the m and M (as in $m \le e \le M$) are given, one can find the smallest and largest number that this system can hold.

Example in addition of two floating numbers using a simplified representation

$6.25 + 703.94 = 0.625 * 10^1 + 0.70394 *10^3$

$= 0.00625 *10^3 + 0.70394 *10^3$

$= 0.71019 *10^3$

$= 710.19$

Note-1: Add./sub. must be done when the 2 operands are converted to the ‘same level’ first.

Note-2: It is possible that some data are lost due to conversion.

Note-3: The result might exceed the upper/lower limit (i.e. an overflow or underflow).

Example in multiplication of two floating numbers using a simplified representation

$6.25 * 703.94 = (0.625 * 10^1) * (0.70394 *10^3)$

$= (0.625 * 0.70394) *10^{1 + 3}$

$= 0.4399625 *10^4$

Note-4: Comment in Note-3 applies.

Note-5: Truncation may occur.

Further note:- In evaluating an expression via more than one steps, different orders or operations may yield different results. Example, computing the average of a and b by (1) $(a + b)/2$ and by (2) $a + (b – a)/2$ may yield different results due to errors like truncation.