Alternative expression for Rounding Function on Floating Point Arithmetic

94 Views Asked by At

Let $\mathbb{F}(\beta,t,e_{\min},e_{\max})$ be a Floating Point Arithmetic. Let $\text{domain}(\mathbb{F}) = [x_{\min},x_{\max}] \subseteq \mathbb{R}$ for minimal and maximal elements $x_{\min},x_{\max} \in \mathbb{F}$. Let $\text{rd}(x): \text{domain}(\mathbb{F}) \longrightarrow \mathbb{F}$ be a rounding operation in accordance with the IEEE standards for rounding.

Under the above assumptions it is known, that for all $x \in \text{domain}(\mathbb{F})$ a $\delta \in [-u(\mathbb{F}),u(\mathbb{F})]$ exists such that $$rd(x) = x(1+\delta),$$ where $u(\mathbb{F}) := \frac{1}{2}\beta^{1-t}$ is the unit roundoff of $\mathbb{F}$. Now, throughout the literature it is often mentioned that the above expression can equivalently formulated as $$rd(x) = \frac{x}{1+\delta^{\ast}}$$ for some $\delta^{\ast} \in [-u(\mathbb{F}),u(\mathbb{F})]$. While conceptually the alternative expression does make sense to me, I am unable to formalise the equivalence of the expressions. How might one progress in order to so do?

1

There are 1 best solutions below

0
On

The two representations are not equivalent, but they can be derived using the same basic principles. Let $$x = (1.f_1f_2\dots)_2 \cdot 2^m$$ denote a positive real number in the representational range. Let $$x_- = (1.f_1f_2\dots f_k)_2 \cdot 2^m$$ denote the largest machine number which is less than or equal to $x$ and let $$x_+ = x_1 + 2^{m-k}$$ denote the next floating point number. The floating point representation $\text{fl}(x)$ of $x$ is $x_-$ when $x_-$ is closer to $x$ than $x_+$ and vise versa. Any ties are resolved as dictated by the rounding mode. In any case, we have $$|x- \text{fl}(x)| \leq \frac{1}{2} 2^{m-k} = 2^m u,$$ where $u = 2^{-k-1}$ is the unit roundoff.

This is the point where the analysis branches.

  1. Since $2^m \leq x $ we have the relative error bound $$\frac{|x- \text{fl}(x)|}{|x|} \leq u.$$ It follows that $fl(x) = x(1 + \delta)$ where $\delta = \frac{\text{fl}(x) - x}{x}$ satisfies $|\delta| \leq u$.
  2. Since $2^m \leq \text{fl}(x)$ we also have $$\frac{|x- \text{fl}(x)|}{|\text{fl}(x)|} \leq u.$$ It follows that $x = \text{fl}(x)(1 + \epsilon)$ where $\epsilon = \frac{x - \text{fl}(x)}{\text{fl}(x)}$ satisfies $|\epsilon| \leq u$. We now conclude that $$\text{fl}(x) = \frac{x}{1 + \epsilon}.$$

Both representations are useful when analyzing floating point calculations.