Expected in-sample error of linear regression with respect to a dataset D

Question

Expected in-sample error of linear regression with respect to a dataset D

4.2k Views Asked by Bumbble Comm At 27 Mar 2026 - 7:52

In my textbook, there is a statement mentioned on the topic of linear regression/machine learning, and a question, which is simply quoted as,

Consider a noisy target, $ y = (w^{*})^T \textbf{x} + \epsilon $, for generating the data, where $\epsilon$ is a noise term with zero mean and $\sigma^2$ variance, independently generated for every example $(\textbf{x},y)$. The expected error of the best possible linear fit to this target is thus $\sigma^2$.

For the data $D = \{ (\textbf{x}_1,y_1), ..., (\textbf{x}_N,y_N) \}$, denote the noise in $y_n$ as $\epsilon_n$, and let $ \mathbf{\epsilon} = [\epsilon_1, \epsilon_2, ...\epsilon_N]^T$; assume that $X^TX$ is invertible. By following the steps below, show that the expected in-sample error of linear regression with respect to $D$ is given by,

$ \mathbb{E}_D[E_{in}( \textbf{w}_{lin} )] = \sigma^2 (1 - \frac{d+1}{N})$

Below is my methodology,

Book says that,

In-sample error vector, $\hat{\textbf{y}} - \textbf{y}$, can be expressed as $(H-I)\epsilon$, which is simply, hat matrix, $H= X(X^TX)^{-1}X^T$, times, error vector, $\epsilon$.

So, I calculated in-sample error, $E_{in}( \textbf{w}_{lin} )$, as,

$E_{in}( \textbf{w}_{lin} ) = \frac{1}{N}(\hat{\textbf{y}} - \textbf{y})^T (\hat{\textbf{y}} - \textbf{y}) = \frac{1}{N} (\epsilon^T (H-I)^T (H-I) \epsilon)$

Since it is given by the book that,

$(I-H)^K = (I-H)$, and also $(I-H)$ is symetric, $trace(H) = d+1$

I got the following simplified expression,

$E_{in}( \textbf{w}_{lin} ) =\frac{1}{N} (\epsilon^T (H-I)^T (H-I) \epsilon) = \frac{1}{N} \epsilon^T (I-H) \epsilon = \frac{1}{N} \epsilon^T \epsilon - \frac{1}{N} \epsilon^T H \epsilon$

Here, I see that,

$\mathbb{E}_D[\frac{1}{N} \epsilon^T \epsilon] = \frac {N \sigma^2}{N}$

And, also, the sum formed by $ - \frac{1}{N} \epsilon^T H \epsilon$, gives the following sum,

$ - \frac{1}{N} \epsilon^T H \epsilon = - \frac{1}{N} \{ \sum_{i=1}^{N} H_{ii} \epsilon_i^2 + \sum_{i,j \ \in \ \{1..N\} \ and \ i \neq j}^{} \ H_{ij} \ \epsilon_i \ \epsilon_j \}$

I undestand that,

$ - \frac{1}{N} \mathbb{E}_D[\sum_{i=1}^{N} H_{ii} \epsilon_i^2] = - trace(H) \ \sigma^2 = - (d+1) \ \sigma^2$

However, I don't understand why,

$ - \frac{1}{N} \mathbb{E}_D[\sum_{i,j \ \in \ \{1..N\} \ and \ i \neq j}^{} \ H_{ij} \ \epsilon_i \ \epsilon_j ] = 0$ $\ \ \ \ \ \ \ \ \ \ \ \ (eq \ 1)$

$(eq 1)$ should be equal to $0$ in order to satisfy the equation,

$ \mathbb{E}_D[E_{in}( \textbf{w}_{lin} )] = \sigma^2 (1 - \frac{d+1}{N})$

Can any one mind to explain me why $(eq1)$ leads to a zero result ?

Original Q&A

There are 2 best solutions below

**Bumbble Comm** · Answer 1 · 2015-09-27 15:27:37

The common explanation for this (I don't know if there is a better one) is that (in linear regression) error components from random variables are assumed to be linearly independent from each other, that is, they are also random and one cannot be used to estimate the other.

With this assumption you will get that the expected sum of their products to have a zero mean.

That does not happen for the expectancy of E(ϵi²), since ϵi² will be always positive, it will never cancel itself out. And for a Gaussian noise E(ϵi²) = σ².

**Bumbble Comm** · Answer 2 · 2017-09-12 09:19:04

As you write above, I start from here.

Let it be :

$$\boldsymbol {E = E_{(N,1)} = {\hat y - y}} $$

$$\begin{align} E_{in}(w_{lin}) &= \frac{1}{N} E^T E \\ &= \frac{1}{N} (\epsilon^T (H-I)^T (H-I) \epsilon) \\ &= \frac{1}{N}(\epsilon^T (I-H) \epsilon) \end{align}$$

then

$$\begin{align} \Bbb E_D \left[ E_{in}(w_{lin})\right] &= \frac{1}{N} \Bbb E_D[\epsilon^T (I-H) \epsilon] \\ &= \frac{1}{N} (\Bbb E_D[\epsilon^T \epsilon] - \Bbb E_D[\epsilon^T H \epsilon]) \\ &= \sigma^2 (1 - \frac{d+1}{N}) \end{align}$$

Because:

(1) $$\begin{align} \Bbb E_D[\epsilon^T \epsilon] &= \Bbb E_D[\epsilon_1^2 + \epsilon_2^2 + \cdots + \epsilon_N^2] \\ &= \Bbb E_D[\epsilon_1^2] + \cdots + \Bbb E_D[\epsilon_N^2] \\ &= N \sigma^2 \end{align}$$

(2) $$\begin{align} \Bbb E_D[\epsilon^T H \epsilon] & = \Bbb E_D \left[ \sum_{i=1}^{N} H_{ii} \epsilon_i^2 + \sum_{i \neq j}^{N}H_{ij} \epsilon_i \epsilon_j \right] \\ &= \Bbb E_D[\sum_{i=1}^{N} H_{ii} \epsilon_i^2] + \Bbb E_D[\sum_{i \neq j}^{N}H_{ij} \epsilon_i \epsilon_j] \\ &= trace(\boldsymbol H) \sigma^2 + 0 \\ &= \sigma^2 (d+1) \end{align}$$

since $$\epsilon_i, \epsilon_j$$ are independent, $$\Bbb E_D[\epsilon_i\epsilon_j] = \Bbb E_D[\epsilon_i] \Bbb E_D[\epsilon_j] = 0$$, and $$\Bbb E(\epsilon_i) = 0$$, $$\Bbb E(\epsilon_i^2) = \sigma^2$$

Expected in-sample error of linear regression with respect to a dataset D

There are 2 best solutions below

Related Questions in PROBABILITY

Related Questions in STATISTICS

Related Questions in REGRESSION

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions