The below explanation is long winded, if you already know about using pseudo inverse to find the best fit solution to a set of simultaneous equations please go down to the tl;dr
The Problem
Given a data set $D$ consisting of $n$ data points of the form $(x_i,t_i)$. we wish to find a model for this system. If we assume the model to be such that the output of the system $t_i$ is a linear combination of basis functions of $x_i$, then the coefficients of this linear combination can be represented as a parameter vector $\vec w_{p \times 1}=[w_1,w_2,...,w_p]^T$.
$t_i = w_1\phi_1(x_i)+w_2\phi_2(x_i)+...+w_p\phi_p(x_i)$
Similarly the basis matrix $\Phi_{n \times p}$ is defined with its $i$th-row = $[\phi_1(x_i),\phi_2(x_i),...,\phi_p(x_i)]$ for $n$ such rows. Where $\phi_i(x)$ is the $i$th basis function. (Here we usually choose $\phi_1(x) = 1$ because we need a bias term in the model)
So we have $p$ parameters for this model and we need to find $\vec w$ from $n$ data points.
$\vec t_{n \times 1} = [t_1,t_2,...,t_n]^T$ is the empirical data from the system, while $\Phi \vec w$ is the model output.
We can define the error vector as $\vec e_{n \times 1} = \vec t-\Phi \vec w$, and for this dataset $D$ as a whole the Mean Square Error will be $\Large E(\vec w) = \frac{\vec e^T \vec e}{n}$
So here note that $\vec w$ defines the model and we get different scalar values from $E$ which tell us how good the model is.
So the question is,
How do we find $\vec w$ such that $E$ is minimum?
The solution
The entire derivation is available in my blog but in short the value of $\vec w$ is,
$\Large \vec w = (\Phi^T\Phi)^{-1}\Phi^TY$
Cool, so tl;dr
We tried to bring $\vec e$ to 0, so the initial set of $n$ simultaneous equations with $p$ parameters is $\vec t-\Phi \vec w = \vec 0 $, That is $\Phi \vec w=\vec t$
Now note that if we apply the linear operator - $\Phi^T$ to the above equation, $\Phi^T (\Phi w) = \Phi^T (t)$
We obtain a new set of simultaneous equations
$\Large (\Phi^T \Phi) w = \Phi^T t$ - (2)
This set of equations have $p$ equations and the same $p$ parameters, thus as long as this square matrix is invertible we can solve for the parameters.
1) From what I know this linear operator is supposed to map $n \times 1$ vector ($\Phi w)_{n \times 1}$ from $R^n $to a $p \times 1$ vector in $R^p$ $((\Phi^T \Phi) w)_{n \times 1}$. But is it also right to think of the linear operator mapping the matrices themselves? changing the coefficients to the set of equations.
2) While this operator makes the equations solvable, I don't get how the solution to these new set of $p$ equations (2) just happen to be identical to the least MSE solution of those $n$ equations (1).
If I am missing anything here please fill in, thanks a lot.
Yes, in the case of the least mean square solution to $A\vec x = \vec b$, $A^T$ does have special significance.
When $A_{n \times m}$ operates on $\vec x_{m \times 1}$ we need to obtain $\vec b_{n \times 1}$. In the case of $n>m$ we might have $\vec b$ lying outside $A$'s column space. In this case we project $\vec b$ onto $A$'s column space (i.e. $C(A)$ ) and obtain $\hat b$. The solution $\vec x_o$ where $A\vec x_o = \hat b$ is called the LMS solution.
We can do this projection by taking an arbitrary vector in $C(A)$, $\vec a$. Now the error vector between $\vec a$ and $\vec b$ will be $\vec e = \vec b - \vec a$. We want to minimize this error, this happens when the error vector is perpendicular to the column space.
To visualize this let b vector be say (1,1,1) and the column space be the XY plane.(Matrix A would be [1,0,0;0,1,0;0,0,0] Now for various points in the XY plane the error vector will be the line connecting (1,1,1) to that point. It's obvious that the point directly below (1,1,1) (if we assume Z axis to be up and down) will give us the shortest error vector. The point will be (1,1,0)
So we need to choose $\vec a = A \vec x_o$ such that $ \vec e \perp C(A)$.
Note that in matrices, the Left null space and Column space are Orthogonal Complements. So every vector which is perpendicular to the Column space of A should lie on the Left null space of A.
So we can write the orthogonality condition as $A^T\vec e = \vec 0$, $$A^T(\vec b - \vec a) = A^T(\vec b - A\vec x_o) = \vec 0$$
As we can see above this is the application of the $A^T$ operator as mentioned in the question, and it operates on the original $A\vec x = \vec b$ problem.
Edit: so if b has some components perpendicular to range(A), by operating on by $A^T$, we can send those to 0. so we can be sure $A^Tb$ lies on range of A.