In my textbook, there is a statement mentioned on the topic of linear regression/machine learning, and a question, which is simply quoted as,
Consider a noisy target, $ y = (w^{*})^T \textbf{x} + \epsilon $, for generating the data, where $\epsilon$ is a noise term with zero mean and $\sigma^2$ variance, independently generated for every example $(\textbf{x},y)$. The expected error of the best possible linear fit to this target is thus $\sigma^2$.
For the data $D = \{ (\textbf{x}_1,y_1), ..., (\textbf{x}_N,y_N) \}$, denote the noise in $y_n$ as $\epsilon_n$, and let $ \mathbf{\epsilon} = [\epsilon_1, \epsilon_2, ...\epsilon_N]^T$; assume that $X^TX$ is invertible. By following the steps below, show that the expected in-sample error of linear regression with respect to $D$ is given by,
$ \mathbb{E}_D[E_{in}( \textbf{w}_{lin} )] = \sigma^2 (1 - \frac{d+1}{N})$
Below is my methodology,
Book says that,
In-sample error vector, $\hat{\textbf{y}} - \textbf{y}$, can be expressed as $(H-I)\epsilon$, which is simply, hat matrix, $H= X(X^TX)^{-1}X^T$, times, error vector, $\epsilon$.
So, I calculated in-sample error, $E_{in}( \textbf{w}_{lin} )$, as,
$E_{in}( \textbf{w}_{lin} ) = \frac{1}{N}(\hat{\textbf{y}} - \textbf{y})^T (\hat{\textbf{y}} - \textbf{y}) = \frac{1}{N} (\epsilon^T (H-I)^T (H-I) \epsilon)$
Since it is given by the book that,
$(I-H)^K = (I-H)$, and also $(I-H)$ is symetric, $trace(H) = d+1$
I got the following simplified expression,
$E_{in}( \textbf{w}_{lin} ) =\frac{1}{N} (\epsilon^T (H-I)^T (H-I) \epsilon) = \frac{1}{N} \epsilon^T (I-H) \epsilon = \frac{1}{N} \epsilon^T \epsilon - \frac{1}{N} \epsilon^T H \epsilon$
Here, I see that,
$\mathbb{E}_D[\frac{1}{N} \epsilon^T \epsilon] = \frac {N \sigma^2}{N}$
And, also, the sum formed by $ - \frac{1}{N} \epsilon^T H \epsilon$, gives the following sum,
$ - \frac{1}{N} \epsilon^T H \epsilon = - \frac{1}{N} \{ \sum_{i=1}^{N} H_{ii} \epsilon_i^2 + \sum_{i,j \ \in \ \{1..N\} \ and \ i \neq j}^{} \ H_{ij} \ \epsilon_i \ \epsilon_j \}$
I undestand that,
$ - \frac{1}{N} \mathbb{E}_D[\sum_{i=1}^{N} H_{ii} \epsilon_i^2] = - trace(H) \ \sigma^2 = - (d+1) \ \sigma^2$
However, I don't understand why,
$ - \frac{1}{N} \mathbb{E}_D[\sum_{i,j \ \in \ \{1..N\} \ and \ i \neq j}^{} \ H_{ij} \ \epsilon_i \ \epsilon_j ] = 0$ $\ \ \ \ \ \ \ \ \ \ \ \ (eq \ 1)$
$(eq 1)$ should be equal to $0$ in order to satisfy the equation,
$ \mathbb{E}_D[E_{in}( \textbf{w}_{lin} )] = \sigma^2 (1 - \frac{d+1}{N})$
Can any one mind to explain me why $(eq1)$ leads to a zero result ?
The common explanation for this (I don't know if there is a better one) is that (in linear regression) error components from random variables are assumed to be linearly independent from each other, that is, they are also random and one cannot be used to estimate the other.
With this assumption you will get that the expected sum of their products to have a zero mean.
That does not happen for the expectancy of E(ϵi²), since ϵi² will be always positive, it will never cancel itself out. And for a Gaussian noise E(ϵi²) = σ².