In the problem of linear regression, we are given $n$ observations $\{ (x_1, y_1),\dots,(x_n, y_n)\}$, where each input $x_i$ is a $d$-dimensional vector. Our goal is to estimate a linear predictor $f(.)$ which predicts $y$ given $x$ according to the formula \begin{equation} f(x) = x^\top\boldsymbol{\theta}, \end{equation} Let ${\bf y} = [y_1, y_2 \dots y_n]^\top$ be the $n \times 1$ vector of outputs and ${\bf X} = [x_1, x_2, \dots x_n]^\top$ be the $n \times d$ matrix of inputs. One possible way to estimate the parameter $\boldsymbol{\theta}$ is through minimization of the sum of squares. This is the least squares estimator: \begin{equation} argmin_{\boldsymbol{\theta}} \ \lVert {\bf y} - {\bf X}\boldsymbol{\theta} \rVert_2^2. \label{eq:LS_q3} \end{equation}
So the optimal value of $\theta$ for this cost function is $\theta=(X^TX)^{-1}X^Ty$. SO we need the matrix $X^TX$ to be invertible. What does this mean in terms of the data itself?
My thought was it has to do with covariance matrix somehow, but I'm not exactly sure.
The matrix $X^{\top}X$ is invertible iff $\operatorname{rank}(X^{\top}X)=\operatorname{rank}(X)=d$, i.e. $X$ does not contain a column that is a linear combination of other columns. Specifically, let $x_k$ denote the $k$-th column of $X$ and let $X_{-k}:=[x_1,\ldots,x_{k-1},x_{k+1},\ldots,x_d]$. Assume, for simplicity, that the sample average of each $x_k$ is $0$. Then the $(k,k)$-th entry of $(X^{\top}X)^{-1}$ is given by \begin{align} [(X^{\top}X)^{-1}]_{k,k}&=\left[x_k^{\top}(I_{d-1}-X_{-k}(X_{-k}^{\top}X_{-k})^{-1}X_{-k}^{\top})x_k\right]^{-1} \\ &=\left[x_k^{\top}x_k\left(1-\frac{x_k^{\top}X_{-k}(X_{-k}^{\top}X_{-k})^{-1}X_{-k}^{\top}x_k}{x_k{^\top}x_k}\right)\right]^{-1} \\ &=\frac{1}{(1-R_k^2)(x_k^{\top}x_k)}, \end{align} where $R_k^2$ is the $R^2$ in the regression of $x_k$ on the columns of $X_{-k}$. In particular, $R_k^2=1$ when $x_k=X_{-k}\beta$ for some $\beta\in \mathbb{R}^{d-1}$.