Consider the generative model for linear regression w.r.t. the true parameter $w^* \in S^{d-1}$ $$y=Xw^*+e$$ with i.i.d. Gaussian error $e \sim N(0, \sigma^2I_n)$. Let $X \in \mathbb{R}^{n\times d}$ is full column rank, using normal equation, $$\hat{w}=(X^\intercal X)^{-1}X^\intercal y=(X^\intercal X)^{-1}X^\intercal (Xw^*+e)=w^*+(X^\intercal X)^{-1}X^\intercal e$$.
Consider the SVD of $\quad X=U\Sigma V^\intercal, \quad U \in \mathbb{R}^{n \times d},\quad V \in \mathbb{R}^{d \times d}, \quad \Sigma = diag(\sigma_1,\sigma_2, \cdots, \sigma_d)\in \mathbb{R}^{d \times d}$
Since, $$ ||\hat{w} - w^*||_2=||(X^\intercal X)^{-1}X^\intercal e||_2=||V\Sigma ^{-1} U^\intercal e||_2=||\Sigma ^{-1} U^\intercal e||_2$$ We have, $$ \mathbb{E}_{e}||\hat{w} -w^*||_2^2 =\mathbb{E}_{e} ||\Sigma^{-1} U^\intercal e||_2^2 = \mathbb{E}_{e} [Tr[e^\intercal U\Sigma^{-2} U^\intercal e] ]= \mathbb{E}_{e} [Tr[ \Sigma^{-2} U^\intercal ee^\intercal U] ]\\ = Tr[ \Sigma^{-2} U^\intercal \mathbb{E}_{e} [ee^\intercal] U] =Tr[ \Sigma^{-2} U^\intercal U] =Tr[ \Sigma^{-2}]= \sum_{i=1}^{d}\frac{1}{\sigma_i^2}=||X^\dagger||_F^2$$
Assume, elements of X are sampled iid from $ N(0,1)$. Results from "High Dimensional Probability by Roman Vershynin" gives $\sigma_1(X) \sim \sqrt{n}+\sqrt{d}$ and $\sigma_d\sim \sqrt{n}-\sqrt{d}$, so that $||X^\dagger||_F \leq \sigma_{\max}(X^\dagger)\sqrt{d}=\frac{ \sqrt{d}}{\sigma_d}=\frac{\sqrt{d}}{\sqrt{n}-\sqrt{d}}$. However I want to prove a stronger result, $\mathbb{E}_{X,e}||\hat{w} - w^*||_2 \leq O (\sigma \sqrt{\frac{d}{n}})$, please help.