What are the properties of data we have to have in order to do least square regression with just matrix multiplication?

35 Views Asked by At

In the problem of linear regression, we are given $n$ observations $\{ (x_1, y_1),\dots,(x_n, y_n)\}$, where each input $x_i$ is a $d$-dimensional vector. Our goal is to estimate a linear predictor $f(.)$ which predicts $y$ given $x$ according to the formula \begin{equation} f(x) = x^\top\boldsymbol{\theta}, \end{equation} Let ${\bf y} = [y_1, y_2 \dots y_n]^\top$ be the $n \times 1$ vector of outputs and ${\bf X} = [x_1, x_2, \dots x_n]^\top$ be the $n \times d$ matrix of inputs. One possible way to estimate the parameter $\boldsymbol{\theta}$ is through minimization of the sum of squares. This is the least squares estimator: \begin{equation} argmin_{\boldsymbol{\theta}} \ \lVert {\bf y} - {\bf X}\boldsymbol{\theta} \rVert_2^2. \label{eq:LS_q3} \end{equation}

So the optimal value of $\theta$ for this cost function is $\theta=(X^TX)^{-1}X^Ty$. SO we need the matrix $X^TX$ to be invertible. What does this mean in terms of the data itself?

My thought was it has to do with covariance matrix somehow, but I'm not exactly sure.

1

There are 1 best solutions below

5
On BEST ANSWER

The matrix $X^{\top}X$ is invertible iff $\operatorname{rank}(X^{\top}X)=\operatorname{rank}(X)=d$, i.e. $X$ does not contain a column that is a linear combination of other columns. Specifically, let $x_k$ denote the $k$-th column of $X$ and let $X_{-k}:=[x_1,\ldots,x_{k-1},x_{k+1},\ldots,x_d]$. Assume, for simplicity, that the sample average of each $x_k$ is $0$. Then the $(k,k)$-th entry of $(X^{\top}X)^{-1}$ is given by \begin{align} [(X^{\top}X)^{-1}]_{k,k}&=\left[x_k^{\top}(I_{d-1}-X_{-k}(X_{-k}^{\top}X_{-k})^{-1}X_{-k}^{\top})x_k\right]^{-1} \\ &=\left[x_k^{\top}x_k\left(1-\frac{x_k^{\top}X_{-k}(X_{-k}^{\top}X_{-k})^{-1}X_{-k}^{\top}x_k}{x_k{^\top}x_k}\right)\right]^{-1} \\ &=\frac{1}{(1-R_k^2)(x_k^{\top}x_k)}, \end{align} where $R_k^2$ is the $R^2$ in the regression of $x_k$ on the columns of $X_{-k}$. In particular, $R_k^2=1$ when $x_k=X_{-k}\beta$ for some $\beta\in \mathbb{R}^{d-1}$.