Consider the linear regression model in the case where number of training samples $n$ < number of parameters $d$ (overparametrized linear regression). The model is
$$y_i = <\beta,\mathbf{x}_i> +\epsilon_i, \quad \beta\in\mathbb{R}^d,\ \epsilon_i\in\mathbb{R},\ i=1,\dots,n$$
where $\epsilon$ is zero-mean noise. Let $\mathbf{X}\in\mathbb{R}^{n\times d}$ be the data matrix:
\begin{align} \mathbf{X} &= \begin{bmatrix} \mathbf{x}_1\\ \mathbf{x}_2\\ \vdots \\ \mathbf{x}_n \end{bmatrix} \end{align}
Now, since $n<d$, $\mathbf{X}^T\mathbf{X}\in\mathbb{R}^{d\times d}$ is not invertible, since $$\text{rank}(\mathbf{X})=\text{rank}(\mathbf{X}^T) \leq \min \{n,d\}=n\implies \text{rank}(\mathbf{X}^T\mathbf{X})\leq n<d$$
There are clearly infinitely many $\beta$ such that
$$\mathbf{y}=\mathbf{X}\beta$$
I'm told that the minimum norm solution is
$$\beta^*=\mathbf{X}^{\dagger}y$$
where, in this specific case $n < d$, $\mathbf{X}^{\dagger}=\mathbf{X}^T(\mathbf{X} \mathbf{X}^T)^{-1}$. Can you help me prove this in a constructive manner? I tried to premultiply by $\mathbf{X}^T\in\mathbb{R}^{d\times n}$ (which is legit since $\mathbf{y}\in\mathbb{R}^n$), but I get
$$\mathbf{X}^T\mathbf{y}=\mathbf{X}^T\mathbf{X}\beta$$
and now I'm stuck because, as noted above, $=\mathbf{X}^T\mathbf{X}$ is not invertible.
EDIT: the solution
$$\beta^*=\mathbf{X}^{\dagger}y$$
is valid iif $\mathbf{X}\mathbf{X}^T$ is invertible. However, since $\mathbf{X}\mathbf{X}^T\in\mathbb{R}^{n\times n}$, $\text{rank}(\mathbf{X}^T\mathbf{X})\leq n $ and $\mathbf{X}$ is a random matrix, meaning that its rows are random vectors sampled from an (unknown) probability distribution, for most nondegenerate probability distributions $\mathbf{X}\mathbf{X}^T$ will be full rank.
EDIT2: a question was mentioned in the comments, which isn't related to my question. I explicitly asked for a constructive proof of the fact that, when $n<d$, the minimum norm solution is
$$\beta^*=\mathbf{X}^T(\mathbf{X} \mathbf{X}^T)^{-1}y$$
instead of the usual one, when $\mathbf{X}$ is full column rank
$$\beta^*=(\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^Ty$$
Thus, answers that don't constructively derive this expression, are not answers to this question.
(Note to whoever was thinking about answering my question: I'm not going to accept my own answer anytime soon, so if you were going to write an answer of yours, don't feel discouraged!)
Proof #1
By Lagrange multipliers: we're looking for
$$\min_{\beta}{\frac{1}{2}||\beta||^2} \quad s.t. \mathbf{X}\beta=\mathbf{y}$$
Convert to unconstrained optimization by introducing a Lagrange multiplier $\lambda$:
$$\min_{\beta,\lambda}\frac{1}{2}||\beta||^2+\lambda^T(\mathbf{X}\beta-\mathbf{y})=\min_{\beta,\lambda}\mathcal{L}(\beta,\lambda)$$
Set gradients to 0:
$$ \begin{equation*} \begin{alignedat}{3} \nabla_{\beta}{\mathcal{L}} & = \beta +\mathbf{X}^T\lambda=0 \\ \nabla_{\lambda}{\mathcal{L}} & = \mathbf{X}\beta-\mathbf{y}=0 \end{alignedat} \end{equation*}$$
Thus $-\mathbf{X}\mathbf{X}^T\lambda=\mathbf{y}$. But since $\mathbf{X}\mathbf{X}^T$ is invertible, we get $$\lambda=-(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{y}\implies\beta=-\mathbf{X}^T\lambda=\mathbf{X}^T(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{y}\ \square.$$
The nice thing about this proof is that you don't have to commit the expression of the pseudo-inverse to memory.
Proof #2
Not as satisfying. Assume $\beta^*=\mathbf{X}^T(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{y}$: we want to prove that
1) is trivial: $\mathbf{X}\beta^*=\mathbf{X}\mathbf{X}^T(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{y}=\mathbf{y}$. To prove 2), it's sufficient to prove that, given any other solution $\beta$ to $\mathbf{X}\beta=\mathbf{y}$, then $(\beta-\beta^*)=\mathbf{u}$ is orthogonal to $\beta^*$. The assertion then follows immediately: $$||\beta||^2=\beta^T\beta=(\mathbf{u}+\beta^*)^T(\mathbf{u}+\beta^*)=||\mathbf{u}||^2+2\mathbf{u}^T\beta^*+||\beta^*||^2=||\mathbf{u}||^2+||\beta^*||^2\geq||\beta^*||^2$$
To prove that $(\beta^*)^T\mathbf{u}=0$, we write
$$(\mathbf{X}^T(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{y})^T\mathbf{u}=\mathbf{y}^T(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{X}\mathbf{u}=\mathbf{y}^T(\mathbf{X}\mathbf{X}^T)^{-1}\mathbf{X}(\beta-\beta^*)=\mathbf{y}^T(\mathbf{X}\mathbf{X}^T)^{-1}(\mathbf{y}-\mathbf{y})=0$$
where we used the fact that $\mathbf{X}\mathbf{X}^T$ is symmetric and so is its inverse.