Relationship Between a Parabola and the Normal Distribution

695 Views Asked by At

In this video (https://www.youtube.com/watch?v=m62I5_ow3O8), at 4:05 the author talks about taking the (2nd Order) Taylor Expansion of the Likelihood Function (in statistics) for some "model parameter". Naturally, the 2nd Order Taylor Expansion will be a "parabola shaped function". Based on this 2nd Order Expansion, the author is interested evaluating the "value of the model parameter" that brings the value of the (2nd Order Expanded) Likelihood Function to $0$.

The author states that instead of directly taking the 2nd Order Expansion of the Likelihood Function, it is more advantageous to take the 2nd Order Expansion of the (negative) Logarithm of the Likelihood Function (I didn't quite understand the reason behind this, why exactly is it advantageous?). However, the author states that the 2nd Order Expansion of the (negative) Logarithm of the Likelihood Function will no longer be shaped like a parabola - but rather, take the form of a Gaussian Distribution (i.e. an "inverted parabola").

In general terms, I am trying to understand the mathematical logic behind this. Suppose I have some arbitrary 2nd Order Function : $f(x) = x^2$ . If I take the second derivative of the of the negative logarithm of this function, I get the following result:

Second derivative: $-\log(x^2) = \frac{2}{x^2}$

If I were to plot this function:

enter image description here

I can see that the "Red Function" somewhat resembles a "bell curved shape" normal distribution.

My Question: Is my understanding of the above correct? And is there any reason as to why the negative logarithm of the 2nd order expansion is "more advantageous" compared to just the 2nd order expansion?

Thanks!

2

There are 2 best solutions below

0
On

I don't get quite well what the author tried to say about approximation, but a few things come to mind.

  1. Let $f(x)$ be an arbitrary (smooth enough) function with local maximum at $x = x_0$. Then, using Taylor approximation near $x_0$, $$ f(x) \approx f(x_0) + f'(x_0)\cdot (x-x_0) + \frac{1}{2} f''(x_0) \cdot (x- x_0)^2 $$ If $x_0$ is the point of local maximum, then $f'(x_0) = 0$ and $f''(x_0) < 0$. It means that you approximate the function with a downward parabola $f(x_0) - \frac{1}{2} |f''(x_0)| \cdot (x - x_0)^2$. But it also means that $\exp\{f(x)\} \approx \exp\{f(x_0) - \frac{1}{2} |f''(x_0)| \cdot (x - x_0)^2\}$, which is a bell-shaped curve.

  2. When dealing with Maximum likelihood estimation (MLE) of some parameter $\lambda$, in case of i.i.d. samples you get likelihood function in form of $$\mathcal{L}(\lambda) = \prod_{k=1}^n f(x_k \mid \lambda)$$ where $f(x_k \mid \lambda)$ are the pdf's with parameter $\lambda$ evaluated at sample $x = x_k$. For example, if you have samples from $N(\lambda, 1)$ (normal distribution with known variance $1$ and unknown mean $\lambda$), $f(x \mid \lambda) = \frac{1}{\sqrt{2\pi}} e^{-\frac{(x-\lambda)^2}{2}}$, so $$ \mathcal{L}(\lambda) = \prod_{k=1}^n \frac{1}{\sqrt{2\pi}} e^{-\frac{(x_k-\lambda)^2}{2}} = \frac{1}{(\sqrt{2\pi})^n} \cdot \exp\left\{-\sum_{k=1}^n \frac{(x_k-\lambda)^2}{2}\right\} $$ It's hard to find maximum of such function analytically, because even $\frac{d\mathcal{L}}{d\lambda}$ will be extremely complicated. However, using the fact that logarithm is a monotone function and hence $\ln \mathcal{L}(\lambda)$ attains maximum value at the same point as $\mathcal{L}(\lambda)$, we can find the argmax of $\ln \mathcal{L}(\lambda)$ instead: $$ \ln \mathcal{L}(\lambda) = \ln \left( \frac{1}{(\sqrt{2\pi})^n}\right) - \sum_{k=1}^n \frac{(x_k-\lambda)^2}{2} = -\frac{n}{2} \cdot \ln \left( 2\pi\right) - \sum_{k=1}^n \frac{(x_k-\lambda)^2}{2} \\ \frac{d \ln\mathcal{L}}{d\lambda} = -\sum_{k=1}^n \frac{d}{d\lambda} \left( \frac{(x_k-\lambda)^2}{2}\right) = \sum_{k=1}^n (x_k - \lambda) = -n \lambda + \sum_{k=1}^n x_k \\ \frac{d \ln\mathcal{L}}{d\lambda} = 0 \Rightarrow \lambda = \frac{1}{n} \sum_{k=1}^n x_k, \; \frac{d^2 \ln\mathcal{L}}{d\lambda^2} = -n < 0 $$ So MLE of $\lambda$ is $\frac{1}{n} \sum_{k=1}^n x_k$. In this case logarithm is used to simplify the calculation needed to find the maximum point of the likelihood function.

0
On

I would agree with @Yalikesifulei comment there. You have seen the graphs of the normal distribution and a rectangular hyperbola, but unless you do vivid mathematical calculations, you can't prove anything. Just saying they look similar doesn't prove any similarity.

For any probability distribution function, if take its integral over its entire domain, the probability should be equal to $1$. Let $Z - N(0, 1^2)$

Then: $\int_{-\infty}^{\infty} {Z}{dx} = 1$

However: $\int_{-\infty}^{\infty} {\frac{2}{x^2}}{dx} = \infty$

The $2$ functions are nowhere similar. You can trying finding similarites, but you are likely not going to succeed that well (not that you will not). Comparing their images is a good description, but not good enough to prove any similarity mathematically.