Why Is The Fisher Information Important?

991 Views Asked by At

I am struggling to understand the relationship between the Fisher Information and the Variance.

So far, what I understand:

  • Given a specific choice of Probability Distribution Function, the partial derivative of the Natural Logarithm of the corresponding Likelihood Function is called the Score Function
  • If we square the Score Function and take its Expected Value - this is the Fisher Information (note: when there are multiple parameters, the Fisher Information will be a Matrix)

Now, the important result from the above, is that apparently:

  • The (Negative) Inverse of The Fisher Information is equal to Variance

As an example, suppose you successfully evaluate the Fisher Information and have a Matrix containing the Fisher Information for all parameters (i.e. if the original Probability Distribution Function has "p" parameters, this will be a "p x p" Matrix) - if you can somehow manage to take the Inverse of this Matrix, the diagonal components of this matrix will contain the Variance Formulae for each of these parameters.

This seems to be a very important fact which is likely very useful in calculating the variance estimates for any probability distribution - but I am not sure why this is true. I tried to consult different references online (e.g. videos, university lecture notes), but I could not come across a source which demonstrated why this result is true.

Can someone please help me (i.e. walk me through the math) behind why the (Negative) Inverse of the Fisher Information is equal to the Variance? Is there a proof for this?

Thanks!

3

There are 3 best solutions below

1
On

In my opinion, the Cramer-Rao Inequality (https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound) might be sufficient for proving the result in question. In your question, you outline the definition of a Likelihood Function and Fisher Information. Although it might not be a rigorous proof, it appears that this inequality shows the inverse relationship between Fisher Information and the Variance.

The only issue I can think of is that the Cramer-Rao Inequality provides a lower bound on the Variance vs. the actual Variance. I am not sure if this subtlety is of importance to you - but nonetheless, this could be a good start. Thus, might want look into the a proof of the Cramer-Rao Inequality (which there are plenty - e.g. https://gregorygundersen.com/blog/2019/11/27/proof-crlb/) and indirectly prove a version of this result.

1
On

Suppose you have a parametric statistical model given by a family of densities $\{f(\cdot, \theta) \mid \theta \in \Theta \subset \mathbb{R}^d \}$ and samples $(X_i)_{ 1 \le i \le n}$ from the density $f(\cdot,{\theta_0})$, where $\theta_0$ is unknown. The task is to use information from the data to estimate $\theta_0$.

A natural approach is that of maximal likelihood estimation. where our estimate is the value of $\theta$ which best explains the observed data. More precisely, for fixed observations $(X_i)_{1 \le i \le n}$ we define $\theta_{\text{MLE}} = \text{argmax}_{\theta \in \mathbb{R}^d}L(X_1,X_2, \cdots X_n, \theta)$, where $L$ is the log likelihood function.

A natural question to ask is how much information the observed data carries about the true parameter? For fixed observations $(X_i)_{1 \le i \le n}$, think of the log likelihood as a function of $\theta$. If $L$ is flat and spread out, then many values of $\theta$ explain the data equally well, and so it is hard to distinguish between them. If $L$ is sharply peaked around $\theta_0$, then $\theta_0$ is easy to identify. Thus for fixed observations, the curvature of the log likelihood function tells us how much information the data carries about the true parameter. But the curvature of the log likelihood is captured by the negative of the Hessian of $L$, given by $H_{ij}(\theta) = \left(\frac{\partial^2L}{\partial \theta_i \partial \theta_j}\right)$, and so the information the data carries about the parameter on average is given by $I(\theta) = -\mathbb{E}_{\theta}\left(\frac{\partial^2{L(X,\theta)}}{\partial \theta_i \partial \theta_j}\right)$, precisely the Fisher information.

We have seen that the Fisher information tells us how much information the data carries about the true parameter. The more the information, the more 'accurately' we can potentially estimate $\theta_0$. As you might expect, the information content of a model puts a ceiling on how well an unbiased estimator can do : The Cramér–Rao lower bound roughly says that the an unbiased estimator must have variance at-least as big as the inverse of the Fisher information.

Can this ceiling be attained? It can be shown that under appropriate regularity conditions, the MLE is consistent, meaning $\theta_{\text{MLE}} \underset{\mathbb{P}}{\to} \theta_{0}$ as $n \to \infty$, and even more strongly that it is asymptotically normal : $\sqrt{n}(\theta_{\text{MLE}} - \theta_{0}) \overset{d}{\to} N(0,I^{-1}(\theta_0))$. Thus the MLE is asymptotically efficient, since it has the best possible performance for an unbiased estimator.

2
On

All your sentences with "the variance" do not make sense (variance of what? "variance of the parameters" does not make sense either since parameters are not random).

There are two main results relating the Fisher information with the variance of something. For simplicity I will state the univariate versions of the results, but there exist multivariate generalizations. The first result is a statement about any unbiased estimator, while the second result is about a specific estimator (the MLE).

  • Cramer-Rao lower bound. If $\hat{\theta}$ is an unbiased estimator of an unknown parameter $\theta$ based on $n$ independent observations, then $\text{Var}(\hat{\theta}) \ge \frac{1}{nI(\theta)}$, where $I(\theta)$ denotes the Fisher information of one observation.
  • Asymptotic distribution of the maximum likelihood estimator (MLE). Let $\hat{\theta}_{n}$ denote the MLE for an unknown parameter $\theta$ based on $n$ independent observations. Under certain regularity conditions, the MLE is asymptotically normal with variance attaining the Cramer-Rao lower bound. Specifically, $\sqrt{n}(\hat{\theta}_n - \theta) \overset{d}{\to} \mathcal{N}(0, \frac{1}{I(\theta)})$.

The notes that you linked in your comment have a good overview of these two results.