Maximum Likelihood Principle; Local vs. Global Maxima

6.1k Views Asked by At

In the statement for estimating parameters through the Maximum Likelihood Principle (MLE), there is no mention of whether to choose a local maximum or a global maximum. (In my very limited reading so far) From the examples given in various textbooks/lecture notes, it seems that we should choose the global maximum of the likelihood function for inference. Is this correct?

The reason I am asking is because I am dealing with some data whose likelihood seems to have several maxima. The parameter space is three dimensional, so I have no intuition about the situation. In this case how do I estimate the parameters properly - do I just look for the maximum in a small part of the parameter space? (The bounds could be established through guesses based on the data, for example.)

4

There are 4 best solutions below

4
On BEST ANSWER

Many, but not all, likelihood functions we usually encounter have strictly convex logarithm (i.e., they're log-concave). Consequently, they have a unique stationary point and that is the global maximum. This doesn't mean that there might be cases where the likelihood has multiple local maxima. You always look for the global maximum in MLE. Keep in mind, however, that MLE is not necessarily a good estimator for all problems and there are common and interesting cases where MLE may produce an estimate with large error.

0
On

Maximum likelihood estimation is based on the principle that you want to maximize the likelihoodfunction, i.e. the function that represents the likelihood of the data that are observed.

As you know, the likelihood function is defined as $$\mathcal{L}(\theta)=\prod\limits_{i=1}^n f(x_i; \theta)$$. So this is basically the product of every pdf where the observed value of $X$ is incorporated. This means, that if this function is maximized, that then it is very likely that, given the value of theta that you find after maximization, these observations are true (and, as they are in fact real data, this is what you want). Hence, the value that you find after maximizing is the value of theta that you will want to use as an estimator.

Therefore, what you are after is a global maximum as all that you want is your likelihood function to me maximized; it could very well happen that this occurs at a point which is not a local maximum. Consider for example the uniform distribution on $[0, \theta]$; with pdf $\frac{1}{\theta}$ for $X$ between $0$ and $\theta$. This function does not have a local maximum. Hence we look at the boundaries of the interval to find the global maximum and you would see that $x_{n:n}$ (the largest observed value of $X$) would be come our estimator. So this example shows that you can find an estimator for theta at a point which is not a local maximum.

0
On

In general, working with MLE depends on the case under consideration.

Things with MLE can go exceptionally good (in presence of concave maximum likelihood function, for example) or very bad.

In this latter case, if you consider logistic regressions in presence of complete separation, then you can prove that the MLE will provide you with a solution which diverges. More precisely, the maximum of the maximum likelihood function lies at the boundary of the $n$-cone generated by the parameters. An additional problem in this case is represented by the fact that statistical softwares will usually provide you with a finite solution and a worning/error message. In other words, you need to work on the error/warning message to understand the divergence of the Newton Raphson algorithm.

2
On

The reason I am asking is because I am dealing with some data whose likelihood seems to have several maxima.

If you have several interior maxima it might be worth documenting what you are doing as that is an atypical case. In the previously mentioned degenerate cases of logistic regression the MLE runs off to infinity (or to the boundary, if you restrict the coefficients), but this is not really a breakdown of the MLE global maximization paradigm, it is from having multiple models that perfectly fit the data.