I was going through the derivation of linear regression and wanted to verify that what I understood is indeed correct?
We minimize this $$ {\lVert y -Xw \rVert}^2 $$
where y is a $n*1$ dimensional vector , X is our data which is $n * d$, and w is a $d*1$ dimensional vector.
Note: Both y and X belong to the training set, as in we are learning $ w_{ls} = {(X^TX)}^{-1} X^Ty $
Now we can reach every possible y, by some linear combination of columns of X, if there are at least n linearly independent columns in $X$.
So we would always be able to find a weight vector w, that makes the total error zero when the above condition hold (is this a right deduction?)
In Matrix Representation, consider we have 4 data points
$ \implies \begin{bmatrix} y1 \\ y2 \\ y3 \\ y4 \\ \end{bmatrix} = \begin{bmatrix} x11 & x12 & x13 & x14 \\ x21 & x22 & x23 & x24 \\ x31 & x32 & x33 & x34 \\ x41 & x42 & x43 & x44 \\ \end{bmatrix} \begin{bmatrix} w1 \\ w2 \\ w3 \\ w4 \\ \end{bmatrix} $
we see that Y is a vector in $R^n$, as X has $n$ linearly independent columns, we can represent every point in $R^n$ as a combination of the columns of X For eg.
$ \implies \begin{bmatrix} y1 \\ y2 \\ y3 \\ y4 \\ \end{bmatrix} = \begin{bmatrix} x11 \\ x21 \\ x31 \\ x41 \\ \end{bmatrix} \begin{bmatrix} w1 \\ \end{bmatrix} + \begin{bmatrix} x12 \\ x22 \\ x32 \\ x42 \\ \end{bmatrix} \begin{bmatrix} w2 \\ \end{bmatrix} + \begin{bmatrix} x13 \\ x23 \\ x33 \\ x43 \\ \end{bmatrix} \begin{bmatrix} w3 \\ \end{bmatrix} + \begin{bmatrix} x14 \\ x24 \\ x34 \\ x44 \\ \end{bmatrix} \begin{bmatrix} w4 \\ \end{bmatrix} $
Now our training labels with n datapoints points can be represented as a single vector in $R^n$ dimensional space, And any values of ${[y1,y2,y3,y4]}^T$ represent a point in this space.
Since all points are reachable by a linear combination of columns, any values of ${[y1,y2,y3,y4]}^T$ must also be reachable and hence error should be zero.
This essentially boils down to, that we can always make the training error zero for a particular ${[y1,y2,y3,y4]}^T$ value, by selecting the right weights.
This is only true if we know that the entries of $y$ will exactly be a linear function of the rows of $X$ (consider the case where there is some noise, in which case you wont be able to exactly drive the error to $0$).
I'm not sure what you mean by "reach some lower dimension".
You need to specify some assumptions on how $y, X$ are related. For example, assume that:
$$y_i = x_i^Tw + \epsilon$$
where $\epsilon \sim N(0, 1)$ is additive gaussian noise. In this case, given $n>d$ samples, we can minimize the squared error by using the formula for $w$ that you've given above.
In contrast to solving a consistent system of equations such as $\alpha = A\beta$ for $beta$ where $A$ is an $n\times n$ matrix, usually when solving for a least square solution, our $X$ is not full-rank (not invertible), so we cannot simply compute $w = X^{-1}y$. Instead, the formula you've given computes $w$ such that $Aw$ is the orthogonal projection of $y$ onto the range of $X$. This what you want since the orthogonal projection onto the range of $X$ is exactly: $\min \|y-u\|_2^2$ subject to $u = Xw$.