Predicting if an Optimization Algorithm is "Doomed to Fail?

40 Views Asked by At

Suppose we have a linear combination weighted (with weights $\pi_i$ ) sum of Normal Distributions (Mixture Distributions https://en.wikipedia.org/wiki/Mixture_distribution):

\begin{align*} p(x|\theta) &= \sum_{i=1}^{k} \pi_i \mathcal{N}(x|\mu_i, \sigma_i^2) \end{align*}

$$\sum_{i=1}^{k} \pi_i = 1$$

Suppose we want to estimate all parameters (i.e. $\theta$ = $\pi_i$, $\mu_i$, $\sigma_i$ ) of this function (i.e. all $\pi_i$, $\mu_i$, $\sigma_i$). We can do this by writing the Likelihood Function of this problem, taking the Log Likelihood, taking the derivatives of the Log Likelihood for each parameter (set equal to 0) and numerically solve (this whole process is called Maximum Likelihood Estimation https://en.wikipedia.org/wiki/Maximum_likelihood_estimation):

\begin{align*} L(\theta|X) &= \prod_{j=1}^{n} p(x_j|\theta) = \prod_{j=1}^{n} \sum_{i=1}^{k} \pi_i \mathcal{N}(x_j|\mu_i, \sigma_i^2) \end{align*}

\begin{align*} log(L(\theta|X)) &= \prod_{j=1}^{n} log(p(x_j|\theta)) = \sum_{j=1}^{n} \sum_{i=1}^{k} \pi_i \mathcal{N}(x_j|\mu_i, \sigma_i^2) \end{align*}

$$\frac{\partial}{\partial \pi_i} \log L(\theta|X) = 0 $$ $$\frac{\partial}{\partial \mu_i} \log L(\theta|X) = 0 $$ $$\frac{\partial}{\partial \sigma_i} \log L(\theta|X) = 0 $$

However, there are some well-known problems which occur during this process for this specific situation (i.e. for Mixture Distributions). Due to the double summation that occurs within the Log Likelihood, the convenient identity $log(ab) = log(a) + log(b)$ can not be used, thus making it very difficult to calculate the derivatives of the log likelihood (actually, another optimization algorithm called Expectation Maximization algorithm is used here which avoids this problem by sequentially optimizing this function https://en.wikipedia.org/wiki/EM_algorithm_and_GMM_model , https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm).

However, there are still some other problems associated with optimizing this function that I am trying to understand.

Zero Variance Estimation due to Singularity Problems : I was talking to a friend about this, and using logic (no rigorous math), we discussed a case where a given dataset has large outliers. We imagined that if there are $k$ clusters and and there is a large outlier, it might be possible a single cluster is assigned to this single outlier. In a single point, there is no variance and thus the variance becomes 0.

However, a more popular problem is described here (https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf , page 433 in the actual book or page 453 in the pdf version of the book). Suppose we are looking at the likelihood for the mixture of two Normal Distributions (one dimensional case):

$$L = \sum_{i=1}^{n} \log \left[ \frac{\alpha}{\sqrt{2 \pi \sigma_1}} \exp \left(-\frac{1}{2} \frac{(x_i-\mu_1)^2}{\sigma_1^2}\right) + \frac{(1-\alpha)}{\sqrt{2 \pi \sigma_2}} \exp \left(-\frac{1}{2} \frac{(x_i-\mu_2)^2}{\sigma_2^2}\right) \right]$$

During a given iteration within the numerical optimization process, if the value of $\mu_1 = x_1$ occurs , then, we the likelihood becomes:

$$L = \sum_{i=1}^{n} \log \left[ \frac{\alpha}{\sqrt{2 \pi \sigma_1}} + \frac{(1-\alpha)}{\sqrt{2 \pi \sigma_2}} \exp \left(-\frac{1}{2} \frac{(x_i-\mu_2)^2}{\sigma_2^2}\right) \right]$$

If this were to happen, we can see that as $\sigma_1$ becomes smaller and smaller (i.e. towards 0), then the value of $L$ becomes larger and larger. And since the objective itself is to maximize the Likelihood, we can see that we will fall into a trap: A numerical optimization algorithm will be tricked into returning a near 0 value of the variance at the expense of maximizing the likelihood function ... eventually brining the value of the likelihood function to infinity and $\sigma_1$ to 0. When this happens, we are said to have encountered a "singularity" and to have "collapsed" the likelihood function.

My Question: I have read about this singularity/collapsing likelihood problem and have always had the following question: If this problem only occurs when one of of the $x_i$ EXACTLY equals to $\mu_i$ .... isn't the probability of this happening extremely small? If the probability of this happening is so small, why do people even talk about this - I suspect that there must be some other ways in which this problem can be manifested. Is it possible that other situations could trigger this exact same problem, e.g. $\mu_i$ being "close" to a collection of multiple $x_i$ points?

In general, can we understand what other types of general situations can trigger the collapse of the likelihood function in this case and how likely these situations are to arise? (i.e. understand how probable it is to encounter these problems)

Thanks!