I'm reading section $2.5$ of Elements of Statistical Learning by Hastie et al (Second edition), and theres an equation I don't quite understand (here we have $N$ training samples, and $X$ are vectors in some dimension, $Y$ scalars).
The authors write on page $24$:
Suppose that we know that the relationship between $Y$ and $X$ is linear, $$ Y = X^T \beta + \epsilon $$
where $\epsilon \sim N(0, \sigma^2)$ and we fit the model by least squares to the training data. For an arbitrary test point $x_0$, we have $\hat{y_0} = x_{0}^T \hat{\beta}$, which can be written as $\hat{y_0} = x_{0}^T \beta + \sum_{i=1}^{N} l_i (x_0) \epsilon_i$ where $\epsilon_i$ is the $i$'th element of $X(X^TX)^{-1} x_0$.
From what I understand, the linear least square solution for a noisy scenario like this would be
$ \hat{\beta} = (X^TX)^{-1}X^TY = (X^TX)^{-1}X^T(X \beta + \epsilon')$ where $\epsilon'$ is a vector of noise values for each training sample.
I'm not seeing how this leads to the equation described in the last line of the quote. Any insights appreciated.
If you look carefully at the text, the $X$ in $Y=X^\top \beta + \epsilon$ is not bold, while the $X$ in the $\mathbf{X}(\mathbf{X^\top X})^{-1} x_0$ is bold.
This is because $Y$ is a real number while $X$ and $\beta$ are in $\mathbb{R}^p$, and $Y = X^\top \beta + \epsilon$ represents the model for a single data point. If you have $n$ data points, then this equation becomes
$$y_{n \times 1} = \mathbf{X}_{n \times p} \beta_{p \times 1} + \epsilon'_{n \times 1}$$ where I have added the dimensions of each term for clarity.
Thus, $$\hat{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top (\mathbf{X}\beta + \epsilon') = \beta + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \epsilon'.$$