Fisher information matrix for Linear model, why add $n$ data points

Question

Fisher information matrix for Linear model, why add $n$ data points

731 Views Asked by Bumbble Comm At 24 Apr 2026 - 9:39

This is regarding the answer by guy for the following question: Get a Fisher information matrix for linear model with the normal distribution for measurement error?

In the answer, guy states "if I observe data items I just add the individual Fisher information matrices". What I don't understand is why are we adding the data items if the fisher information is typically derived from a single observation?

For example, is $_$ iid to some $X_{\theta}$ then we typically use $\log((_1;))$ to find the FIM, instead of $\log((_1,_2,...,_;))$, where $L(X_1, X_2, ...,X_n; \theta)$ is the likelihood of $X_1, X_2,.. ,X_n$.

For example, let $X_1, X_2, ..., X_n$ be iid to $\exp(\lambda)$, then the FIM = $-E[l''(\lambda)]$, where $l(\lambda) = \log (\lambda e^{-\lambda x_1})$. This is the log likelihood of a single observation.

But for the linear model, we are given $n$ observations$(X_1, Y_1), ..., (X_n, Y_n)$ iid to some $(X,Y)$. The FIM is calculated from $I(\beta) = - \sum_{i=1}^{n} E[ - \frac{1}{\sigma ^2} X_i X_i ^ T]$. Notice here that we use all $n$ observations in the calculations while in the above example we only use one observation.

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2022-03-25 15:43:57

Let $(\theta, x) \mapsto f_\theta(x)$ be a function with domain $\Theta \times \mathrm{S} \subset \mathbf{R}^q \times \mathbf{R}^p$ and values in $\mathbf{R}_+ = [0, \infty)$ such that $f_\theta(\cdot)$ is a density on $\mathbf{R}^p$ (i.e. it integrates to $1$ on $\mathrm{S},$ the "common support of the random variables") and for each $x,$ $\theta \mapsto f_\theta(x)$ is smooth in some sense (e.g. it has 2 derivates and integration and differentiation are exchangeable). The likelihood function (based on observed data $x$) is, by definition, the partial function $$ L:\Theta \to \mathbf{R}, \quad \theta \mapsto f_\theta(x). $$ The score function (based on observed data $x$) is, by definition, the derivative of the log-likelihood $$ s:\Theta \to \mathbf{R}^q, \quad \theta \mapsto \partial_\theta \log L = \dfrac{L'(\theta)}{L(\theta)} = \dfrac{\partial_\theta f_\theta(x)}{f_\theta(x)}. $$ Notice that we omit writting in both $L$ and $s$ its dependency on $x$ as it is commonly done in statistics, yet these are functions of $x.$ Write them as $L(\theta; x)$ and $s(\theta; x).$ Since $x$ is a random outcome, both $L$ and $s$ are random. For $\theta \in \Theta,$ we define the (Expected) Fisher Information (based on observed data $x$) under the assumption that the "true model" is that of $\theta$" as the variance (a.k.a. dispersion matrix) of the random vector $s(\theta)$ when we assume that the random variable $x$ has density $f_\theta(\cdot).$ Thus, $$ \mathbf{I}(\theta) := \mathbf{V}_\theta(s(\theta)) := \int\limits_\mathrm{S} (s(\theta; u) - \mu_\theta)(s(\theta); u) - \mu_\theta)^\intercal f_\theta(u) du, $$ where $\mu_\theta$ is the expected value of $s(\theta)$ assuming the true model is $f_\theta(\cdot).$ (I wrote $u$ so that you don't confuse the dummy variable of integration with the observed data $x.$ Notice that this function does not depend on the observed data.) Notice that $$ \mu_\theta = \int\limits_\mathrm{S} s(\theta; u) f_\theta(u) du = \int\limits_\mathrm{S} \partial_\theta f_\theta(u) du = \partial_\theta \int\limits_S f_\theta(u) du = \partial_\theta 1 = 0, $$ by the assumed smoothness. So $$ \mathbf{I}(\theta) = \int\limits_\mathrm{S} s(\theta) s(\theta)^\intercal f_\theta(u) du. $$

So far, so good. We can assume now that this setting models the repetition of many independent and identicall distributed random outcomes. That is, we can assume $p = nd$ and that on $\mathbf{R}^d$ we have a family of smooth densities $g_\theta(\cdot)$ on $\mathbf{R}^d.$ If $x_i \in \mathbf{R}^d$ ($1 \leq i \leq n$) are identically distributed random outcomes from this family $(g_\theta(\cdot)),$ we write $x^\intercal = (x_1^\intercal, \ldots, x_n^\intercal)$ as the observed data $x$ and therefore, the likelihood based on this observed data is $$ L(\theta; x) = f_\theta(x) = \prod\limits_{i = 1}^n g_\theta(x_i). $$ The score function is therefore, $$ s(\theta; x) = \sum_{i = 1}^n \partial_\theta \log g_\theta(x_i) = \sum_{i = 1}^n s(\theta; x_i). $$ Aha! The score function of an independent sample is the sum of the individual score functions, call these $s_i.$ Since we assume that the $x_i$ are independent under any of the $f_\theta,$ we have that $\mathbf{V}_\theta(s(\theta)) = \sum\limits_{i = 1}^n \mathbf{V}_\theta(s_i)$ and since all the $x_i$ follow the same distribution (they are assumed to follow $g_\theta$ when we are calculated $\mathbf{V}_\theta$), we have $$ \mathbf{I}(\theta) = n \mathbf{I}_1(\theta) $$ where $\mathbf{I}_1(\theta)$ is the information function of a single observation with density $g_\theta(\cdot).$ So, the information of an i.i.d. sample is the size of the sample times the information of any of the variables.

What happens in Linear Regression? During ordinary linear regression, we assume the model $$ y = X\beta + \varepsilon, $$ where $X \in \mathsf{Mat}_{n \times p}$ is deterministic (non-random) and without error, $\beta \in \mathbf{R}^p$ is the "parameter" to be estimated, $\varepsilon \sim \mathsf{Norm}_n(0; \sigma^2 I_n)$ and $y \in \mathbf{R}^n$ is the observed data. Often one allows $X$ to be random but we then assume that we work with the conditional distribution given $X$ (so that $X$ is again deterministic). Under any of these circumstances, we really are assuming that the observed data has distribution $y \sim \mathsf{Norm}_n(X\beta, \sigma^2 I_n).$ Notice that this implies that the individual observations $y_i \sim \mathsf{Norm}(x_i^\intercal \beta, \sigma^2)$ are independent, where $x_i^\intercal$ is the $i$th row of $X.$ Obviously, the $y_i$ are not identically distributed unless the rows of $X$ are the same, which is not useful (and in fact, it makes the model to crash quite hard). Thus, to calculate the information, we use the whole sample, and not just $n$ times the information function os a single observation.

Fisher information matrix for Linear model, why add $n$ data points

There are 1 best solutions below

Related Questions in STATISTICS

Related Questions in STATISTICAL-INFERENCE

Trending Questions

Popular # Hahtags

Popular Questions