This is something I definitely know how to do, but is slipping my mind at the moment. I'm reading through Rice's Mathematical Statistics and the following is the proof to establish an estimate for the errors of the independent variables of a linear regression (You can ignore the majority of the proof it is only the last line I'm stuck on):
I understand the proof of $E(||\textbf{Y} - \hat{\textbf{Y}} ||^{2}) = (n-p) \sigma^{2}$$. What I'm having trouble figuring out is how the
$$s^{2} = \frac{||\textbf{Y} - \hat{\textbf{Y}} ||^{2}}{(n-p)}$$
Came about, specifically what was on the other side to allow us to divide?
$$\text{?something?} = E(||\textbf{Y} - \hat{\textbf{Y}} ||^{2}) = (n-p) \sigma^{2}$$
I ask because if we're taking the expectation of $E(||\textbf{Y} - \hat{\textbf{Y}} ||^{2})$, then the only thing I see appearing on the other side would be:
$$\sum_{i = 1}^{n}E(\textbf{Y} - \hat{\textbf{Y}})^2$$
Which to me doesn't amount to anything special. Since what I want is
$$||\textbf{Y} - \hat{\textbf{Y}} ||^{2}$$ without the expectation operator.


We are looking for an estimator $s^2$ for $\sigma^2$ which is unbiased that means:
$$ E[s^2] = \sigma^2 $$
if we define $s^{2} = \frac{||\textbf{Y} - \hat{\textbf{Y}} ||^{2}}{(n-p)}$ by the theorem A we get:
$$ \begin{aligned} E[s^{2}] &= \frac{E[||\textbf{Y} - \hat{\textbf{Y}} ||^{2}]}{(n-p)} \\ &= \frac{(n-p)}{(n-p)} \sigma^{2} = \sigma^2 \end{aligned} $$
This proves that, in fact, $s^2$ is unbiased.
In general terms, we have a model that depends on the value of some parameter $\theta$ (in this case: $\sigma^2$; once we have $\sigma^2$ we can fit our model to the data). An unbiased estimate $\hat{\theta}$ (in this case: $s^2$) is a value extracted from data which represents the information that $\theta$ is supposed to show us. It represents that parameter in such a way, that in average (i.e. expected value) it's equal to our parameter, that means: $$ \text{Bias}(\theta, \hat{\theta}) = 0 \iff E[\hat{\theta}] = \theta \left( \text{ i.e.} \iff E[s^2] = \sigma^2 \right) $$ There's no general strategy to find $\hat{\theta}$, so we explore the model and try to find some pattern, that allows us to declare such a $\hat{\theta}$. In our case, that pattern is given by theorem A, because we have an expected value on one side and our parameter, namely $\sigma^2$, explicitly on the other side. Given that structure, we propose the functional form for $s^2$ given by the book.