Understanding a statement from Elements of Statistical Learning about noisy linear regression

53 Views Asked by At

I'm reading section $2.5$ of Elements of Statistical Learning by Hastie et al (Second edition), and theres an equation I don't quite understand (here we have $N$ training samples, and $X$ are vectors in some dimension, $Y$ scalars).

The authors write on page $24$:

Suppose that we know that the relationship between $Y$ and $X$ is linear, $$ Y = X^T \beta + \epsilon $$

where $\epsilon \sim N(0, \sigma^2)$ and we fit the model by least squares to the training data. For an arbitrary test point $x_0$, we have $\hat{y_0} = x_{0}^T \hat{\beta}$, which can be written as $\hat{y_0} = x_{0}^T \beta + \sum_{i=1}^{N} l_i (x_0) \epsilon_i$ where $\epsilon_i$ is the $i$'th element of $X(X^TX)^{-1} x_0$.

From what I understand, the linear least square solution for a noisy scenario like this would be

$ \hat{\beta} = (X^TX)^{-1}X^TY = (X^TX)^{-1}X^T(X \beta + \epsilon')$ where $\epsilon'$ is a vector of noise values for each training sample.

I'm not seeing how this leads to the equation described in the last line of the quote. Any insights appreciated.

1

There are 1 best solutions below

2
On BEST ANSWER

If you look carefully at the text, the $X$ in $Y=X^\top \beta + \epsilon$ is not bold, while the $X$ in the $\mathbf{X}(\mathbf{X^\top X})^{-1} x_0$ is bold.

enter image description here

This is because $Y$ is a real number while $X$ and $\beta$ are in $\mathbb{R}^p$, and $Y = X^\top \beta + \epsilon$ represents the model for a single data point. If you have $n$ data points, then this equation becomes

$$y_{n \times 1} = \mathbf{X}_{n \times p} \beta_{p \times 1} + \epsilon'_{n \times 1}$$ where I have added the dimensions of each term for clarity.

Thus, $$\hat{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top (\mathbf{X}\beta + \epsilon') = \beta + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \epsilon'.$$