Does $(X'X)^{-1}$ always exist?

1.1k Views Asked by At

I'm studing Machine Learning theory and I have a questions about Normal Equation. Normal Equation is:

$\Theta = (X'X)^{-1}X'Y\tag 1$

I now that ( in some cases) we can use this other equation:

$\Theta = X^{-1}Y\tag 2$

But the problem is that $X$ might not have an inverse, so it's not recommended to use $(2)$.

My question is: If $(2)$ is not usually used because $X$ might not have an inverse, does $X'X$ in $(1)$ always have an inverse?

Thank you for everyone!

4

There are 4 best solutions below

5
On BEST ANSWER

There are some points in your question that may warrant discussing at a conceptual level of what are we trying to achieve, rather than how to do it.

We are in the context of an over-determined system: more equations than unknowns. The unknowns are the parameters or coefficients in the linear system: $\Theta=\begin{bmatrix}\theta_1,\theta_2,\dots,\theta_n\end{bmatrix}^\top,$ with which we want to relate the explanatory variables (features or regressors) in the columns of the model matrix $X$ to the dependent variable or outcome $Y$ as: $Y=X\Theta.$

The problem stems from the fact that these explanatory variables are typically measured many times, once for every subject or example - for instance, in a medical study, the age, weight, height, blood pressure and cholesterol (explanatory variables) may be measured in hundreds of patients (matrix $X$), and attempted to relate to a dependent variable $Y$ (for example, the concentration of some biochemical marker for cancer in blood). Note that his is the opposite problem to an under-determined system in which there are only a few rows of measurements.

The equation $(2)$ is therefore not an option: the matrix $X$ is rectangular and cannot be inverted. If it was invertible, we would actually be in the situation where we have as many observations as unknowns, the points would lie on a point in $m$-dimensional space, and there would be no need to project.

Intead this is what the linear algebra of the subspaces of $X$ look like in an overdetermined problem with linearly independent columns of $X$:

enter image description here

Notice how the rank of $X$ is going to coincide with the number of columns $n,$ and the left nullspace, where all our woes reside, will expand in dimensionality as the number of observations ($m$ rows in the dataset $X$) increases (dim left nullspace $=m - n$ since the rank coincides with $n$):

enter image description here

Since what we have is the $Y$ observations of the independent variable living in $\mathbb R^m,$ but what we want is the vector $\hat \Theta$ that lives in the row space of $X$ we have a problem: although the column space of $X$ can be inverted, vectors that are not strictly in the hyperplane spanned by the $\text{Col}(X)$ will not be invertible to the extent that their components in the left null space or $\text{Null}(X^\top)$ are the part of $X^\top$ that would have been mapped to zero by the errors $\epsilon,$ and hence, cannot be recovered by an inverse matrix.

Projecting is what we need to settle for in a real-life noisy example: we project the vector $Y$ onto the column space $X,$ a $m \times n$ matrix where $m >> n.$ We look for a solution to the orthogonal projection of the outcome vector $ Y$ onto the subspace created by the $m$ columns of $X,$ which form a hyperplane within $\mathbb R^m.$ The projected vector of $Y$ is typically denoted by a hat, $\hat Y.$

enter image description here

This acknowledges that no linear combination of the columns of $X$ can produce exactly $Y.$ If the matrix was square and full rank $m,$ there would be no need to project.

As pointed out multiple times by now, $X^\top X$ can only be inverted when the columns of $X$ are linearly independent. This is almost always the case in noisy, real-life matrices of data. And when this is the case $(X^\top X)^{-1}X^\top$ is a good second best attempt at an inverse: for instance, it produces the identity if multiplied on the right by $X$ as in $(X^\top X)^{-1}X^\top X=I.$ It can easily be proven that it will produce the coefficients of the orthogonal projection, i.e. the error term will be perpendicular to the $\text{Col}(X).$ The coefficients will be thus calculated as

$$\hat \Theta = \left(X^\top X \right)^{-1} X^\top Y$$

The singular value decomposition can be used beyond the cases where $X$ has linearly independent columns to obtain the Moore–Penrose pseudoinverse, $X^+$ discussed above. In cases when there is collinearity (less than full column rank) we can use the pseudoinverse $X^+= V\Sigma^+ U^\top$ to estimate the parameters $\Theta =X^+ Y.$ This is indeed flexible in that for any model matrix $X$ decomposed via SVD into $X=U\Sigma V^\top,$ we can find an inverse

$$X^+=V\Sigma^{-1}U^\top.$$

2
On

$(X'X)^{-1}$ is NOT always invertible. Consider X a row vector, then $X'X$ is a matrix with rank 1.

In fact, $(X'X)^{-1}X'$ is the MP pseudo inverse of X, a generalization of inverse $X^{-1}$.

2
On

Given a system of linear equations $Ax =b$, one typically finds $x$ which solves the system letting

$$x=A^{-1}b$$

However, in machine learning we typically want to find an approximate solution to $Ax=b$, not an exact solution. This is because the approximate solution will account for generalization. Now, the approximate solution of

$$Ax=b$$

is given by

$$A'A x = A'b$$

$$(A'A )^{-1}A'A x = (A'A )^{-1}A'b$$

$$ x = (A'A )^{-1}A'b$$

this somewhat inconsequential multiplication of both sides of $Ax=b$ by $A'$ is the basis of least squares, which was discovered by Gauss https://en.wikipedia.org/wiki/Least_squares

Although $(X'X)^{−1}$ is NOT always invertible for most of the practical purposes you may assume it is. This is what people typically do in machine learning

STRANG, Gilbert. The fundamental theorem of linear algebra. The American Mathematical Monthly, v. 100, n. 9, p. 848-855, 1993.

https://www.uvm.edu/pdodds/teaching/courses/2009-01UVM-124/docs/strang1993a.pdf

0
On

As an enigneer, you may be familiar with the Singular Value Decomposition(SVD).

Now, decomposing $X= U\Sigma V^T$ with $U\in\mathbb R^{N\times N}, V\in\mathbb R^{M\times M}$ orthogonal and $\Sigma=\big[\begin{smallmatrix}D & 0 \\0& 0\end{smallmatrix}\big]\in\mathbb R^{N\times M}$ with $D=\operatorname{diag}(\sigma_1,\ldots,\sigma_r)$. Let's define $\Sigma^+ = \big[\begin{smallmatrix}D^{-1} & 0 \\0& 0\end{smallmatrix}\big]$ which is $M\times N$.

As we will see a solution to the the normal equation is then given by:

$$ \theta^* = X^+y \overset{\text{def}}{=}V\Sigma^+U^Ty $$

where $X^+$ is known as the Moore-Penrose-Pseudoinverse. Then, in the euclidean norm holds:

$$\begin{aligned} \|X\theta-y\|&= \|U\Sigma V^T \theta - y\|&\text{using SVD}\\ &= \|\Sigma V^T\theta - U^T y \| &\text{since $U$ orthonormal}\\ &=\|\Sigma V^T \theta - (\Sigma\Sigma^+ +\Pi) U^Ty\| &\text{where $\Pi:= I -\Sigma\Sigma^+$}\\ &= \|\Sigma(V^T\theta-\Sigma^+U^Ty) - \Pi U^T y \| &\text{regrouping} \\&= \Big\|\big[\begin{smallmatrix} D & 0 \\ 0& 0 \end{smallmatrix}\big](V^T\theta-\Sigma^+U^Ty) - \big[\begin{smallmatrix} 0 & 0 \\ 0& I \end{smallmatrix}\big] U^T y\Big\| \\&= \|\Sigma(V^T\theta-\Sigma^+U^Ty)\| + \|\Pi U^T y \| &\text{vectors are orthogonal} \end{aligned}$$

Here, the second term is independent of $\theta$ and the first term is minimal, in fact zero, iff $$V^T \theta = \Sigma^+ U^T y \iff \theta = V\Sigma^+ U^T y = X^+ y$$ Crucially, in the last step we see how the SVD decouples the problem into a solvable and unsolvable part. In particular, this proves constructively that $X^+y$ satisfies the normal equation, since it is the first order necessary condition for a minimum.