How to deal with underflow issues in high-dimensional entropy calculation?

197 Views Asked by At

I was not sure if the question makes sense here or should better be placed in a computational/CS forum, but I hope you can give me some insights.

I am working in image processing and use the following formula to compute the entropy of a $d\times N$ matrix that contains $N$ vectorized regions sampled across an image as in [1]. I am using the assumption that the sampled points follow a Gaussian distribution with covariance matrix $\Sigma_d\in \mathbb{R}^{dxd}$ to be able to compute the entropy directly as in [2].

$$H_g(\Sigma_d ) = \log\left((2\pi e)^{\frac{d}{2}}|\Sigma_d|^\frac{1}{2}\right)$$

As the covariance matrix is positive definite symmetric the determinant should be greater than zero.

In practice the problem seems to be that the covariance matrix $\Sigma_d$ is near-singular in many cases and hence $|\Sigma_d|^\frac{1}{2}$ becomes $0$ (due to precision reasons?) and the log goes to $-\inf$. This is problematic as I use the entropies to compute the mutual information in an optimization process.

My idea now was to rewrite the formula as

$$ H_g(\Sigma_d ) = \frac{d}{2}\log2\pi e + \frac{1}{2}\log|\Sigma_d| $$ and use $|M| = \prod_{i=1}^d \lambda_i$ where $\lambda_i$ are the Eigenvalues of $M \in \mathbb{R}^{d\times d}$ and hence $\log|\Sigma_d| = \sum_{i=1}^d \log\lambda_i$ to get

$$ H_g(\Sigma_d ) = \frac{d}{2}\log(2\pi e) + \frac{1}{2} \sum_{i=1}^d \log \lambda_i $$

This seems to be more robust against underflow issues (resulting in $-\inf$) when computing the log of near-zero Eigenvalues.

My question now is:

Why is this the case from a mathematical/numerical standpoint? So, why is it less often the case that I get a $-\inf$ result this way? Is there a computational reason, why this way is numerically more robust?

[1] Russakoff, Daniel B.; Tomasi, Carlo; Rohlfing, Torsten; Maurer, Calvin R. jun., Image similarity using mutual information of regions, Pajdla, Tomáš (ed.) et al., Computer vision – ECCV 2004. 8th European conference on computer vision, Prague, Czech Republic, May 11–14, 2004. Proceedings, Part III. Berlin: Springer (ISBN 3-540-21982-X/pbk). Lecture Notes in Computer Science 3023, 596-607 (2004). ZBL1098.68852.

[2] Cover, Thomas M.; Thomas, Joy A., Elements of information theory, Wiley Series in Telecommunications. New York: John Wiley & Sons, Inc. xxii, 542 p. (1991). ZBL0762.94001.

1

There are 1 best solutions below

6
On BEST ANSWER

In numerical work we distinguish between the exact target values and the computed approximations. Frequently, it is impossible to obtain the target value exact as rounding errors accumulate and we are not necessarily able to compensate fully. Your workaround functions perfectly as long as the computed eigenvalues are positive. All IEEE compliant computers will balk at computing the product $ab$ where $a=b=2^{-600}$ because the exact result is $c=2^{-1200}$ which is far outside the representational range of IEEE double precision numbers. The computed result is $\hat{c} = 0$. Computing $\log_2(\hat{c})$ will result in -INF. In contrast, your workaround (applied to $\log_2$) will return the correct result of $-1200$.

You will still experience problems when one of your eigenvalues is computed as a tiny negative number. This is not uncommon even when the exact matrix is symmetric positive definite. This is due to the fact that the smallest eigenvalues are very sensitive to perturbations of the original matrix. Mathematically, we say that they are ill-conditioned. The rounding errors committed when extracting the eigenvalues will push them a little bit and this can be enough to ruin the computed sign when the exact values are sufficiently small. The problem is fundamental and there is very little that you can do about it. When you attempt to compute the logarithm of a strictly negative number an IEEE compliant machine will return NaN (Not A Number) and raise a flag. NaNs will ruin all future calculations as any operation involving at least one NaN will produce another NaN and rightfully so.

An alternative which is less likely to fail dramatically is to compute the determinant differently. Instead of extracting the eigenvalues you compute a QR factorization of a matrix $A$. It takes the form $AP = QR$. Here $P$ is permutation matrix, $Q$ is orthogonal and $R$ is upper triangular with diagonal entries which will always be computed as nonnegative real numbers. Moreover, the diagonal entries will decay monotonically starting in the upper left corner. Technically, this is called "a $QR$ factorization with column pivoting". For the LAPACK implementation see this link. You have $|\text{det}(A)| = |\text{det}(R)|$ and you can deploy your scheme to compute the determinant of $R$. You will still need to check for the very first zero on the diagonal of $R$. You cannot take advantage of the fact that $A$ is symmetric.

However, and I cannot stress this enough, you are not necessarily getting an accurate result. You are increasing the odds of avoiding a crash and getting a result. That may be good enough.