Method for finding the largest positive difference between two pairs of IEEE 754 double precision floating point numbers and fixed-point numbers

52 Views Asked by At

I have two pairs of IEEE 754 double precision (64-bit) floating-point numbers and unsigned fixed-point numbers, and I'm trying to find which pair has the greatest difference.

The fixed-point numbers have two decimal places, so 123 represents 1.23. The fixed-point number has a range of $[0,2^{32}]$. It is stored in a 64-bit unsigned format that spans $[0,100\times2^{32}]$ to accommodate the full fixed-point range. Both floating-point numbers are in the range $[0,2^{32}]$.

More formally, for pairs (tuples) of floating-point and unscaled fixed-point numbers $(f_a, x_a)$ and $(f_b,x_b)$, I would like to evaluate whether the following inequality is true:

$$\left(f_a-\frac {x_a} {100}\right) > \left(f_b-\frac {x_b} {100}\right)$$

For example, I might compare:

$$(f_a, x_a) = (0.0100000000000000002081668171172, 1) \\ (f_b, x_b) = (0.0299999999999999988897769753748, 3)$$

$$0.0100000000000000002081668171172 - \frac 1 {100} = 2.081668171172 \times 10^{-19} \\ 0.0299999999999999988897769753748 - \frac 3 {100} = -1.1102230246252 \times 10^{-18} \\ 2.081668171172 \times 10^{-19} > -1.1102230246252 \times 10^{-18}$$

As such, in this case $(f_a, x_a)$ is considered the larger pair.

Another example:

$$(f_a, x_a) = (6.32000000000000028421709430404, 632) \\ (f_b, x_b) = (14.7200000000000006394884621841, 1472)$$

$$6.32000000000000028421709430404 - \frac {632} {100} = 2.8421709430404 \times 10^{-16} \\ 14.7200000000000006394884621841 - \frac {1472} {100} = 6.394884621841 \times 10^{-16} \\ 2.8421709430404 \times 10^{-16} < 6.394884621841 \times 10^{-16}$$

In this case $(f_b, x_b)$ is considered the larger pair.

While I've shown cases where the fixed-point value closely matches the floating-point value, for explanatory purposes, this will not always be the case.

There may be other cases where the magnitude of the difference between the two sides of the inequality is also extremely small and not precisely representable as an IEEE 754 double precision number.

Critically, this test must work for floating-point values that are very close to their associated fixed-point values, and must not result in any loss of precision. Since there are fixed-point numbers that cannot be precisely represented as floating-point, and vice versa, I cannot convert one way or the other to perform a simple subtraction and comparison in a common format.

While this comparison check could be implemented as a string conversion, that would be an extremely inefficient approach.

IEEE 754 double precision floating point numbers are essentially in the following format:

$$(-1)^{s} \times \left(1 + \sum_{i=1}^{52} {m_{52-i} \times 2^{-i}}\right)\times2^{e-1023}$$

Where $s$ is the sign bit, $m$ is the set of binary bits in the mantissa (or "fraction"), and $e$ is the exponent.

packed bit format for IEEE 754 doubles; credit Codekaizen, CC BY-SA 4.0

(from wikipedia)

Therefore, each side of the comparison can be expressed as:

$$\left[(-1)^{s} \times \left(1 + \sum_{i=1}^{52} {m_{52-i} \times 2^{-i}}\right)\times2^{e-1023}\right] - \frac x {100}$$

Since the floating-point numbers are always positive, $s$ is always zero, so this can be simplified to:

$$\left[\left(1 + \sum_{i=1}^{52} {m_{52-i} \times 2^{-i}}\right)\times2^{e-1023}\right] - \frac x {100}$$

Given this, we might instead choose to represent each floating point number as a pair of integers representing the values of $e$ and $m$.

My question is this: given two sets of integers $(e_a,m_a,x_a)$ and $(e_b,m_b,x_b)$, is there a precise and efficiently computable (assuming a modern x86 processor with AVX-512) way to determine which of the two is larger given the inequality at the start of this question?