So, I'm looking at this paper and trying to understand where equation 5 comes from.
Looking at wikipedia, I see that they would use $\mathbf{X} = (\mathbf{A^T}\mathbf{A})^{-1}\mathbf{A^T}\mathbf{y}$ which is equivalent to $\mathbf{X} = \mathbf{A^{-1}}\mathbf{y}$. These equations give the same answer so why bother with the more complex one? Is this a generalisation to higher order? If so, why? I've seen vague references to reducing dimensionality.
From the above equations I can see that the covariance matrix is introduced to weight the above and form equation 5 in the paper. But how does it end up in the middle. And while I'm at it, why the inverse of the covariance matrix?
(It's been a while since I used matrices and linear algebra)
Thanks
You should read the text surrounding equation (6):
In particular, we generally have that $A$ is $m \times n$ with $m > n$. In general, it might be impossible to solve $y = Ax$, depending on your choice of $y$. If, however, we consider the least squares problem, we'll always have a solution (that is, at least one solution). Moreover (assuming the columns of $A$ are linearly independent), we'll always have a formula for the solution given a vector $y$, which is even better.