Derive $ \sum_{i=1}^N \operatorname{Cov}\left(\hat{y}_i, y_i\right)=d \sigma_{\varepsilon}^2 $.

70 Views Asked by At

In Elements of Statistical Learning, Chapter 7, the author has the important relation $$ \mathrm{E}_{\mathbf{y}} \left(\operatorname{Err}_{\mathrm{in}}\right) =\mathrm{E}_{\mathbf{y}} (\overline{\mathrm{err}})+\frac{2}{N} \sum_{i=1}^N \operatorname{Cov}\left(\widehat{y}_i, y_i\right). $$ This expression simplifies if $\widehat{y}_i$ is obtained by a linear fit with $d$ inputs or basis functions. For example, $$ \sum_{i=1}^N \operatorname{Cov}\left(\hat{y}_i, y_i\right)=d \sigma_{\varepsilon}^2 $$ for the additive error model $Y=f(X)+\varepsilon$.

My question is how to derive $$ \sum_{i=1}^N \operatorname{Cov}\left(\widehat{y}_i, y_i\right)=d \sigma_{\varepsilon}^2. $$ Here is what I have tried: $$ \begin{aligned} & \sum_{i=1}^N \operatorname{cov}\left(\widehat{y}_i, y_i\right)=\sum_{i=1}^N \operatorname{cov}\left(\widehat{y}_i, \widehat{y}_i-\widehat{y}_i+y_i\right) \\ & =\sum_{i=1}^N \operatorname{cov}\left(\widehat{y}_i, \widehat{y}_i-\left(y_i-\widehat{y}_i\right)\right) \\ & =\sum_{i=1}^N \operatorname{cov}\left(\widehat{y}_i, \widehat{y}_i\right)-\operatorname{cov}\left(\widehat{y}_i, y_i-\widehat{y}_i\right) \\ & =\sum_{i=1}^N \operatorname{var}\left(\widehat{f}\left(x_i\right)\right)-\operatorname{cov}\left(\widehat{f}\left(x_i\right) ,\varepsilon_i\right) \\ & =\sum_{i=1}^N \operatorname{var}\left(x_i^{\top} \widehat{\beta}\right)-\operatorname{cov}\left(x_i^{\top} \widehat{\beta}, \varepsilon_i\right) \\ \end{aligned} $$

Based on the equation (7.12) on page 224, i.e. $\sum_{i=1}^N \operatorname{var}\left(x_i^{\top} \widehat{\beta}\right) = d \sigma^2_{\varepsilon}$ (After some effort I actually don't know how to derived that, but I guess that is a separate question), $\operatorname{cov}\left(x_i^{\top} \widehat{\beta}, \varepsilon_i\right)$ must be 0.

But I since $\widehat\beta = (X^TX)^{-1}X^TY$, clearly $\operatorname{cov}\left(x_i^{\top} \widehat{\beta}, \varepsilon_i\right)$ is not 0.

Later reflection: this method is actually ok. The mistake I made is thinking $y_i - \hat{y_i}$ as $\varepsilon_i$. In fact, $y_i - \hat{y_i}$ is the residual. And by the fact that $\hat{\beta}$ is independent from the residual, the second term vanishes.

1

There are 1 best solutions below

0
On BEST ANSWER

Assume that $Y=X\beta+\epsilon$ and $\epsilon\sim N(0,\sigma_{\epsilon}^2I_d)$ is the model, where $\beta$ has $N$ rows and $d$ columns. In other words, $y_i=x_i^T\beta + \epsilon_i$ and $\epsilon_i\sim N(0, \sigma_\epsilon^2)$, where $x_i^T$ is a row vector. Then $$ \begin{align} \sum_{i=1}^N cov(\widehat{y}_i,y_i) &= \sum_{i=1}^N cov\left(x_i^T\widehat{\beta},x_i^T\beta + \epsilon_i\right)\\ &= \sum_{i=1}^N cov\left(x_i^T \left(X^TX\right)^{-1}X^T\left(X\beta +\epsilon\right),x_i^T\beta + \epsilon_i\right)\\ &=\sum_{i=1}^N x_i^T \left(X^TX\right)^{-1}X^T cov(\epsilon, \epsilon_i)\\ &=\sum_{i=1}^N x_i^T \left(X^TX\right)^{-1}X^T \sigma_\epsilon^2e_i\\ &=\sigma_\epsilon^2\sum_{i=1}^N x_i^T \left(X^TX\right)^{-1}x_i\\ &=\sigma_\epsilon^2\sum_{i=1}^N \text{trace}\left(x_i^T \left(X^TX\right)^{-1}x_i\right)\\ &=\sigma_\epsilon^2\sum_{i=1}^N \text{trace}\left(x_ix_i^T \left(X^TX\right)^{-1}\right)\\ &=\sigma_\epsilon^2 \text{trace}\left(\sum_{i=1}^N x_ix_i^T \left(X^TX\right)^{-1}\right)\\ &=\sigma_\epsilon^2 \text{trace}\left( \left(\sum_{i=1}^N x_ix_i^T \right)\left(X^TX\right)^{-1} \right)\\ &=\sigma_\epsilon^2 \text{trace}\left( \left(X^TX\right)\left(X^TX\right)^{-1} \right)\\ &=\sigma_\epsilon^2\text{trace}(I_d)\\ &=d\sigma_\epsilon^2 \end{align} $$

which is what you wanted.