I'm reading The Elements of Statistical Learning. In the process I found myself lost in the notation, so I'm trying to get one that's as explicit and pedantic as possible:
$X$, $Y$: continous random variables
$p_{X,Y}(x,y): $ joint probability distribution of $x$ and $y$
$p_X(x):$ marginalized distribution for $x$
$p_{Y\mid X}(x\mid y):$ conditional distribution of $y$ given $x$
This gives us the following notation of Bayes' Theorem:
$$\underbrace{p_{X\mid Y}(x\mid y) \cdot p_Y(y)}_{p_{X,Y}(x,y)} = \underbrace{p_{Y\mid X}(y\mid x) \cdot p_X(x)}_{p_{X,Y}(x,y)}$$
using this, marginalization looks like this:
$$p_X(x) = \int_Y p_{X,Y}(x,y)\cdot dy = \int_Y p_{X\mid Y}(x\mid y) \cdot p_Y(y) \cdot dy$$
Given a function $h(x,y)$, we can define expected values as:
$$E_{X}[h(X,Y)] := \int_X h(x,y) \cdot p_X(x) \cdot dx$$
$$E_{Y\mid X}[h(X,Y)] := \int_Y h(x,y) \cdot p_{Y\mid X}(y\mid x) \cdot dy$$
$$E_{X,Y}[h(X,Y)] := \int_X \left(\int_Y h(x,y) \cdot p_{Y\mid X}(y\mid x) \cdot dy\right) \cdot dx$$
This allows us to write the law of total expectation as follows:
$$E_{X,Y}[h(X,Y)] = E_X[E_{Y\mid X}[h(X,Y)]]$$
Writing the expected value of a sum of looks as follows:
$$E_{X,Y}[X + Y] = E_X[X] + E_Y[Y]$$
So, the subscript of E tells us, over which variable and which distribution to integrate.
So far so good, the notation is quite cumbersome, but it's consistent. I run into a problem when trying to do the following:
Assume
- $y = f(x) + \epsilon$
We then to compute the expected squared prediction error of $E_{X,Y}[(\hat f(X)-Y)^2]$ of a predictor $\hat f$. Along the way (see Wikipedia) we get a term $E_{X,Y}[(Y-f(X))^2]$, which according to the above definition is $E_{X,Y}[\epsilon^2]$.
Now how do I know this is equal to $E_\epsilon[\epsilon^2]$?