Floating point rounding

277 Views Asked by At

Having trouble proving these two statements are true if we assume no overflow occurs and all rounding modes(round down, round up, round to zero, round to nearest) are valid.

1) If x is non zero finite floating point number, then round(x+x) = round(2x)

2) If x and y are two finite floating point numbers, then round(x-y) = -round(y-x)

1

There are 1 best solutions below

0
On

I would try in this way, under your hypothesis.

1) is $x$ (normal or subnormal) then $2x$ is a fp number which is normal (assuming no overflow) in this case the rounding will have no effect, since basically the mantissa is not shifted in such operation and the exponent will be increased by one, or at most it will be left shifted if the number is subnormal and the exponent will be unchanged. About $round(x + x)$ just apply the rule of the addition since

$$x + x = (-1)^{s_x}2^{e_x} (\mu_x + \mu_x),$$

where $\mu_x$ is the significand (it could have a leading 1 or not depending if $x$ is normal or subnormal). The sum is a fixed point sum, so it doesn't entail no error, i.e. is exact so you can write

$$(-1)^{s_x}2^{ex} (\mu_x + \mu_x) = (-1)^{s_x} 2^{e_x} (2\mu_x)$$

if we assume that $\mu_x = 0n.\mu_{-1}\mu_{-2}...\mu_{-t}000$ where $t$, $n$ is 1 if $x$ is a normal number otherwise it is 0, and the last three zeroes are related to guard, round and sticky bits, then $2\mu = n\mu_{-1}.\mu_{-2}...\mu_{-t}0000$ i think at this point we can distinguish two cases, if $n = 1$ the sum result will be right shifted and the exponent will be incremented, but since the the guard digits will be 0 then the rounding doesn't affect anything, otherwise the exponent will be unchanged while the resulting mantissa will corresponds to a left shift. So basically the result it's the same.

2) $round(x - y) = - round(y - x)$ i think is more trivial since basically either $x - y$ or $y - x$ the actual sum to be performed is the same. Indeed as first step the algorithm for a floating point sum will compare both $|x|$ and $|y|$ and swaps them in the case $|x| < |y|$. So assuming $|x| \geq |y|$ and that the result $round(x - y) = z$, if we perform $round(y - x)$ the algorithm will swap $y$ with $x$ which will outputs $-z$ so

$$round(y - x) = - z = - round(x - y) \Rightarrow - round(y - x) = round(x - y)$$.