Does maximum likelihood estimation maximize probability really?

133 Views Asked by At

In maximum likelihood estimation (for continuous rv) we would like to maximize some parameter of our density function such that at some fixed observations our density function is bigger or equal for same density function bur for all other parameters. I would like to say that if we maximize the likelihood function, the probability that our observations occur is maximized as well. But, I can’t really produce a proof of this (I can’t convince myself that around each sample point I can maximize its probability by maximizing the density function). If someone could explain this I would appreciate it

2

There are 2 best solutions below

3
On BEST ANSWER

In some sense, no. If the density functions $f_\theta$ are continuous, for any fixed $x\in\mathbb{R}^n$ we have: $$ \mathbb{P}_\theta\left[\bigcap_{i=1}^nX_i=x_i\right]\leq \mathbb{P}_\theta\left[X_1=x_1\right]=0 $$ In other words, the probability of observing any specific values $x_1,\ldots,x_n$ is always zero. If you want, the MLE still maximizes the probability that the observed values occur (which is zero), but so does any other parameter estimate.

If your random variable only takes finitely many values, then: $$ \mathbb{P}_\theta[X_1=x_1]=f_\theta(x_1,\theta) $$ So, in this case the MLE really gives the parameter $\theta$ which maximizes the probability of observing the data.


Edit: Here's another way to look at the MLE, if the distribution of the $X_i$ is continuous. We could choose the parameter $\hat{\theta}_h$ which maximizes the probability of observing values in a small ball $(-h,h)$ around our observations, for $h>0$: $$ \hat{\theta}_h:=\arg\max_{\theta\in\Theta}\prod_{i=1}^n\mathbb{P}_{\theta}\left[ X_i\in(x_i-h,x_i+h) \right] =\arg\max_{\theta\in\Theta}\prod_{i=1}^n\frac{\mathbb{P}_{\theta}[ X_i\in(x_i-h,x_i+h)]}{2h} $$ Here I assumed for simplicity that the $X_i$ are independent. For any $i$, we can write this probability as: $$ \frac{\mathbb{P}_{\theta}[ X_i\leq x_i+h]-\mathbb{P}_{\theta}[ X_i\leq x_i-h]}{2h} $$ $$ =\frac{\mathbb{P}_{\theta}[ X_i\leq x_i+h]-\mathbb{P}_{\theta}[ X_i\leq x_i]}{2h} +\frac{\mathbb{P}_{\theta}[ X_i\leq x_i]-\mathbb{P}_{\theta}[ X_i\leq x_i-h]}{2h} $$ As $h\rightarrow 0$, $$ \rightarrow \frac{f(x_i,\theta)}{2}+\frac{f(x_i,\theta)}{2}=f(x_i,\theta) $$ So, we can expect that $\hat{\theta}_h\rightarrow\hat{\theta}_{MLE}$ (this last step can be made rigorous, but you may need some regularity assumptions).

In words: In the continuous case, you could define an estimator which maximizes the probability that your observations lie in a small ball around the observations. The MLE is the limiting case of this estimator, as the size of these balls shrinks to zero.

0
On

The definition of likelihood function is $ P(X| \theta)$ , where $X$ is your observation and $\theta $ is your parameter, $P$ is the probability density function or probability mass function.

So in discrete case, likelihood is just probability of getting observation $X$ conditional on $\theta$. Likelihood in wiki

In continuous case, it is the PDF of $X|\theta$, which same as your "probability of getting particular observation" in a loose sense. (Since in a continuous case, probability of getting any partiuclar observation is always zero)

So maximize likelihood is by definition maximizing probability of getting particular observation.