The meaning behind $(X^TX)^{-1}$

17.8k Views Asked by At

In linear algebra, we learn that the inverse of a matrix "undoes" the linear transformation. What exactly is the meaning of the inverse of $(X^TX)^{-1}$?

$X^TX$ we know as being a square matrix whose diagonal elements are the sums of squares. So what are we doing when we take the inverse of this? I have always used this property in my calculations but would like to understand more of the meaning behind it.

3

There are 3 best solutions below

0
On

Probably the main intuition you will get from the fact that for OLS model you have $$ \operatorname{Var}(\hat{\beta}) = \sigma^2_{\epsilon}(X'X)^{-1}, $$ namely, you can view $(X'X)^{-1}$ as matrix that in a sense measures the stability of your model.

2
On

When $X$ is a real matrix, the elements of $(X^TX)^{-1}$ also provide a measure of the extent of linear dependence among the columns of $X$.

If $X^TX$ is invertible then the columns of $X$ have to be independent, but sometimes the the columns are "almost" dependent in a sense which will be made clear below.

Denote the $i$th column of $X$ by $x_i$ and let let $\hat{x_i}$ denote the projection of $x_i$ on space spanned by $\{x_j : j \neq i \}$. Call $\epsilon_i = x_i - \hat{x_i}.$ Not that if any $\|\epsilon_i\|$ is "small", it suggests strong linear dependence among the columns of $X$

One can prove the $ij$th element of $(X^TX)^{-1}$ is $\dfrac{\epsilon_i^T\epsilon_j}{\|\epsilon_i\|^2\|\epsilon_j\|^2}.$

In particular the ith diagonal element of $(X^TX)^{-1}$ is $\dfrac{1}{\|\epsilon_i\|^2}$. So if the $i$th column of $X$ is almost a linear combination of other columns, it will be indicated by a very large value at the $i$th diagonal element of $(X^TX)^{-1}$.

---Added later---

We can prove the expression for the elements of the inverse as follows.

Assume we have $p$ independent columns $x_1,\dots,x_p$. Let $\hat{x_i}$ be the projection of $x_i$ on the space spanned by $\{x_j : j \neq i\}$ and let $e_i = x_i - \hat{x_i}$

By the definition of orthogonal projection $e_i$ is orthogonal to any vector in the span of $\{x_j : j \neq i\}$. So $e_i^T x_j = 0, j \neq i$.

Since $\hat{x_i} = x_i - e_i$ is in the space spanned by $\{x_j : j \neq i\}$. So $e_i^T (x_i - e_i) = 0$, i.e., $e_i^T x_i = \|e_i\|^2$.

$e_1,\dots,e_p$ are independent because $\sum_{i=1}^p a_i e_i = 0$ implies, for any $j$, $(\sum_{i=1}^p a_i e_i )^T x_j = 0$, i.e., $a_j e_j^T x_j = a_j \|e_j\|^2 = 0$ so $a_j = 0.$

Since $\text{span}\{e_1,\dots,e_p\} \subset \text{span}\{x_1,\dots,x_p\}$ the independence of $e_i$s implies $\{e_i\}$ and $\{x_i\}$ are different bases of the same space.

There exist $p \times p$ matrices $A=(a_{ij})$ and $B=(b_{ij})$ such that $$ x_i = \sum_{k=1}^p a_{ik} e_k $$ and $$ e_i = \sum_{k=1}^p b_{ik} x_k$$ By the change of basis formula, $A$ and $B$ must be inverses of each other.

Note $x_i^T x_j = ( \sum_{k=1}^p a_{ik} e_k )^T x_j = a_{ij} e_j^T x_j = a_{ij} \|e_j\|^2$.

So $a_{ij} = \dfrac{x_i^T x_j}{\|e_j\|^2}$, for all $i,j$ i.e., $A = (X^T X) \begin{pmatrix} \dfrac{1}{\|e_1\|^2} & 0 & 0 & \dots & 0 \\ 0 & \dfrac{1}{\|e_2\|^2} & 0 & \dots & 0 \\ \dots & \dots & \dots & \dots \\ 0 & 0 & \dots & 0 & \dfrac{1}{\|e_p\|^2}\end{pmatrix}$.

Similarly, we can prove $b_{ij} = \dfrac{e_i^T e_j}{\|e_j\|^2}$.

So, $B = (E^T E) \begin{pmatrix} \dfrac{1}{\|e_1\|^2} & 0 & 0 & \dots & 0 \\ 0 & \dfrac{1}{\|e_2\|^2} & 0 & \dots & 0 \\ \dots & \dots & \dots & \dots \\ 0 & 0 & \dots & 0 & \dfrac{1}{\|e_p\|^2}\end{pmatrix}$ where $E$ is the matrix with columns $e_1,\dots,e_p$.

Since $B = A^{-1}$ we have,

$$ (X^T X)^{-1} = \begin{pmatrix} \dfrac{1}{\|e_1\|^2} & 0 & 0 & \dots & 0 \\ 0 & \dfrac{1}{\|e_2\|^2} & 0 & \dots & 0 \\ \dots & \dots & \dots & \dots \\ 0 & 0 & \dots & 0 & \dfrac{1}{\|e_p\|^2}\end{pmatrix} E^T E \begin{pmatrix} \dfrac{1}{\|e_1\|^2} & 0 & 0 & \dots & 0 \\ 0 & \dfrac{1}{\|e_2\|^2} & 0 & \dots & 0 \\ \dots & \dots & \dots & \dots \\ 0 & 0 & \dots & 0 & \dfrac{1}{\|e_p\|^2}\end{pmatrix}$$ and the result follows.

0
On

To motivate the projections in Arin's great answer in the context of OLS:

Let's say I'm in the usual setting for OLS. So I have an $n\times p$ matrix $X$ whose rows are the ``inputs", and an unknown weight vector $\beta$ such that the future output vector $\vec{y}=X\beta+N(\vec{0},\sigma^2 I)$. Let $\hat{\beta}$ denote the fitted weight vector, $\hat{y}=X\hat{\beta}$ the fitted outputs, and $\hat{\sigma}^2=\|\hat{y}-\vec{y}\|^2/(n-p)$ the sample variance.

As $\hat{\beta}_j-\beta_j=\vec{e_j}^t(X^tX)^{-1}X^t(\vec{y}-X\beta)$ and $\|X(X^tX)^{-1}\vec{e_j}\|^2=(X^tX)_{jj}^{-1}$, $\frac{\hat{\beta_j}-\beta_j}{\sigma}\sim N(0,(X^tX)_{jj}^{-1})$ and as this is independent of $\hat{\sigma}^2$, $\frac{(\hat{\beta}_j-\beta_j)^2}{\hat{\sigma^2}(X^tX)^{-1}_{i,i}}\sim F(1,n-p)$. This gives us our typical means of testing a hypothesis like $\beta_j=0$ (although we usually take a square root and make it a $t$-test).

But there's a second approach, which will generalize better. Let $S$ denote the subspace spanned by the columns of $X$, let $S_j$ denote the subspace of $S$ spanned by all columns except column $j$, and let $S'_j$ be the one-dimensional orthogonal complement to $S_j$ in $S$.

By spherical symmetry of $N(\vec{0},\sigma^2I)$, we have that $\|{\rm proj}_{S'_j}(\vec{y}-X\beta)\|^2/\sigma \sim \chi^2_1$. And since $\tilde{y}=X\hat{\beta}\in S$ and $X\beta\in S$, we see that ${\rm proj}_{S'_j}(\vec{y}-X\beta)={\rm proj}_{S'_j}(X\hat{\beta}-X\beta)$ (and so is independent of $\hat{\sigma}^2$). Letting $\vec{c}_j$ denote the $j$th column of $X$, we can write this as $(\hat{\beta_j}-\beta_j){\rm proj}_{S'_j}(\vec{c_j})$. Therefore:

$\frac{(\hat{\beta_j}-\beta_j)^2\|{\rm proj}_{S'_j}(\vec{c_j})\|^2}{\hat{\sigma^2}}\sim F(1,n-p)$. This affirms Arin's answer - by comparing the tests, we must have $(X^tX)^{-1}_{jj}=\frac{1}{{\rm proj}_{S'_j}(\vec{c_j})\|^2}$

As an aside, this generalizes nicely to testing multiple weights. Consider a subset $J$ of the features; let $S_J$ denote the span of the columns in $J$ and $S'_J$ the orthogonal complement of $S_J$ in $S$. Then:

$\frac{\|{\rm proj}_{S'_J}(\sum\limits_{j\in J}(\hat{\beta}_j-\beta_j)\vec{c}_j)\|^2/|J|}{\hat{\sigma}^2}\sim F(|J|,n-p)$

The general linear $F$ test, for example, tests all $\beta_j=0$ except the bias.