I am a computer science researcher who has to learn some numerical linear algebra for my work. I have been struggling with the SVD and Moore-Penrose pseudoinverse as of late. I am trying to solve some problems to get more comfortable with what should probably be routine manipulations.
First of all, I have gone through similar questions on Stack Exchange but I believe they were more general and are not equivalent. I am working in the framework where $A^{\dagger} = V\Lambda^{\dagger}U^T$. So, basically, I am using SVD's. The matrix $A$ of course is identified with $U\Lambda V^T$
Problem
Consider the matrix equation $Ax=y$, where $A\in R^{m\times n}$. The corresponding least squares problem is to find a least squares solution $x_{\text{LS}}$ that minimizes the Euclidean norm of the residual, i.e.,
$$\|Ax_{\text{LS}}-y\| = \min_{x \in \Bbb R^n} \|Ax-y\| = \min_{z \in \mbox{Ran}(A)}\|z-y\|$$
a) Show that $A^{\dagger}y$ is a least-squares solution and satisfies the normal equation $A^TAx=A^Ty$. Why is this solution special?
b) Show that $\ker(A^TA) = \ker(A)$.
c) Use the above results to deduce that $x \in \Bbb R^n$ is a least-squares solution if and only if it satisfies the normal equation.
Help on any or all of these parts is appreciated. I'd also appreciate links to relevant posts. Like I said, I've read similar questions but did not understand them as they were in a more general framework.
Edit: I have solved b). It didn't depend on a) as I had initially thought and it is pretty straightforward to solve, see eg. here: Prove that for a real matrix $A$, $\ker(A) = \ker(A^TA)$
Edit: I realize that part a) might be more involved than I had expected... Assuming parts a) and b), can someone help me with part c)?
Answer to a) and b) follows from Why does $\operatorname{null}(A) = \operatorname{null}(A^TA)$, intuitively? and Why does SVD provide the least squares and least norm solution to $Ax=b$? as poited out in comments.
Let $\tilde{f}$ being a fixed minimizer of $$Q(f)=\|Af-g\|_K^2,$$ in which $$ \qquad A\in \mathbb{R}^{m\times n},\quad f\in \mathbb{R}^{n}=H,\quad g\in \mathbb{R}^{m}=K.$$
Let $h\in H$, and note that $$Q(f+h)=Q(f)+\langle Af-g,Ah\rangle_K+\langle Ah,Af-g\rangle_K+\|Ah\|_K^2\,\quad \tag{1},$$ in which $\langle\cdot,\cdot\rangle$ denotes the inner product. In particular $$Q(f+h)=Q(f),\qquad f\in H, \quad h\in \ker(A).$$
This and $(1)$ means that $$Q(\tilde{f})\leq Q(\tilde{f}+h)=Q(\tilde{f})+2\langle A^T(A\tilde{f}-g),h\rangle_H+\|Ah\|_K^2,\qquad h\in H.$$
Since $\tilde{f}$ is a critical point of $Q$, it follows that $$\nabla Q(\tilde{f})=2A^T(A\tilde{f}-g)=0,$$ and $(A\tilde{f}-g)\in \ker(A^*)=range(A)^\perp$.
You can find more details, in a more general case, in Theorem 1.1 of The mathematics of computerized tomography. You can also find related results searching for "\(f=A^+g\)" on SearchOnMath.