I am trying to find a proof for the MSE of a linear regression:
\begin{gather} \frac{1}{n}\mathrm{E} \left[ \| \mathbf{X}\mathbf{\hat{w}} - \mathbf{X}\mathbf{w}^{*} \|^{2}_{2} \right] = \sigma^{2}\frac{d}{n} \end{gather}
The variables are defined as follows:
$\mathbf{X} \in \mathbb{R}^{n \times d}$: full column rank feature matrix with $n$ features in $d$ dimensions
$\mathbf{z} \in \mathbb{R}^{n}$: gaussian distributed noise vector with $\mathcal{N}(0, \mathrm{diag}(\sigma^2, \cdots, \sigma^2))$
$\mathbf{y} \in \mathbb{R}^{n}$: noisy measurement of a true signal $\mathbf{y}^{*}$ with additive gaussian noise $\mathbf{y} = \mathbf{y}^{*} + \mathbf{z}$
The computed optimal weights are provided by multiplying the above $\mathbf{y}$ with the pseudo-inverse of $\mathbf{X}$:
\begin{gather} \mathbf{\hat{w}} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal(\mathbf{y}^{*} + \mathbf{z}) \end{gather}
With the true optimal weights denoted by
$$\mathbf{w}^{*} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^\intercal\mathbf{y}^{*}$$
The expectation is taken over the noise-vector $\mathbf{z}$ with all other variables assumed to be determined/non-probabilistic.
So far I have tried all kinds of shenanigans with the SVD $\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^\intercal$ but could only come to the result of
$$\frac{1}{n}\mathrm{diag}(\sigma^2, \dots, \sigma^2)$$
My main problem is figuring out how $d$ gets into the equation.
First, not that $w^* = (X^TX)^{-1}X^T y$ is generally incorrect, because $X^TX$ might be singular. Instead, it is expressed in terms of the pseudoinverse $w^* = X^+y$.
The solution then is to use the famous trace-trick:
$$\begin{aligned} \tfrac{1}{n}\big[‖X\hat{w} - Xw^* ‖_2^2\big] &= \tfrac{1}{n}\big[‖XX^+z‖_2^2\big] \\&= \tfrac{1}{n}[z^⊤(XX^+)^⊤ XX^+z] \\&= \tfrac{1}{n}[z^⊤XX^+z] \\&= \tfrac{1}{n}[(z^⊤XX^+z)] \\&= \tfrac{1}{n}[(XX^+zz^⊤)] \\&= \tfrac{1}{n}([XX^+zz^⊤]) \\&= \tfrac{1}{n}(XX^+⋅[zz^⊤]) \\&= \tfrac{1}{n}(XX^+⋅σ^2 _n) \\&= \tfrac{1}{n}σ^2 (XX^+) \\&= \tfrac{1}{n}σ^2 \mathbf{rank}(X) \end{aligned}$$
In particular, if $n≥d$ and $X$ has full column rank, then
$$\tfrac{1}{n}\big[‖X\hat{w} - Xw^* ‖_2^2\big] = \tfrac{d}{n}σ^2$$