What is the difference between empirical loss and expected loss? And how can you understand the intuituion and the use of the latter, considering the followign notation used in the notes for the lectures of my current Machine Learning course:
- $\mathcal{X}$ - the sample space.
- $\mathcal{Y}$ - the label space.
- $X \in \mathcal{X}$ - unlabeled sample.
- $(X, Y) \in(\mathcal{X} \times \mathcal{Y})$ - labeled sample.
- $S=\left\{\left(X_{1}, Y_{1}\right), \ldots,\left(X_{n}, Y_{n}\right)\right\}$ - a training set. We assume that $\left(X_{i}, Y_{i}\right)$ pairs in $S$ are sampled i.i.d. according to an unknown, but fixed distribution $p(X, Y)$.
$h: \mathcal{X} \rightarrow \mathcal{Y}$ - a hypothesis, which is a function from $\mathcal{X}$ to $\mathcal{Y}$.
$\mathcal{H}$ - a hypothesis set.
$\ell\left(Y^{\prime}, Y\right)$ - the loss function for predicting $Y^{\prime}$ instead of $Y$.
$\hat{L}(h, S)=\frac{1}{n} \sum_{i=1}^{n} \ell\left(h\left(X_{i}\right), Y_{i}\right)$ - the empirical loss (a.k.a. error or risk) of $h$ on $S$. (In many textbooks $S$ is omitted from the notation and $\hat{L}(h)$ or $\hat{L}_{n}(h)$ is used to denote $\hat{L}(h, S)$.)
$L(h)=\mathbb{E}[\ell(h(X), Y)]$ - the expected loss (a.k.a. error or risk) of $h$, where the expectation is taken with respect to $p(X, Y)$.
EDIT
I am not entirely sure I understand why expected value of the loss $\mathbb{E}[\ell(h(X), Y)]$ isn't equal to the empirical loss $\frac{1}{n} \sum_{i=1}^{n} \ell\left(h\left(X_{i}\right), Y_{i}\right)$ and hence that $L(h)=\hat{L}_{n}(h, S)$ ?
If we developp just a little bit, and relying on your notations, the expected loss is: $$\mathbb{E}[\ell(h(X),Y)] = \int_{\mathcal{X} \times \mathcal{Y}} \ell(h(x),y) dP(x,y)$$
where $P$ is the cumulative distribution function of $(x,y)$. It is important to explicitely mention that the expectation $\mathbb{E}$ is taken over the joint distribution of $X$ and $Y$. This expression is theoretical: it gives the mean theoretical loss assuming that $p(x,y)$ is known. All the possible values of $X$ and $Y$ are scanned in the integral, but the associated loss is given a weight through the probability of $(x,y)$. As you can see, the expected loss implies that $P(x,y)$ (or equivalently $p(x,y)$ ) is known.
In practice, your starting point is the data: $\left\{\left(X_{1}, Y_{1}\right), \ldots,\left(X_{n}, Y_{n}\right)\right\}$, which are realizations of the random variables $X$ and $Y$. In practice you don't know $P(x,y)$ nor $p(x,y)$. What do you do then?
You estimate the expected loss $$\int_{\mathcal{X} \times \mathcal{Y}} \ell(h(x),y) dP(x,y)$$ through:
$$\int_{\mathcal{X} \times \mathcal{Y}} \ell(h(x),y) dP(x,y) \approx \frac{1}{n} \sum_{i=1}^{n} \ell\left(h\left(X_{i}\right), Y_{i}\right)$$
The right hand side, $\frac{1}{n}\sum_{i=1}^{n} \ell\left(h\left(X_{i}\right), Y_{i}\right)$, is the empirical loss. It is computed directly from the data without any knowledge on $p(x,y)$. Strictly speaking, if you see your data as random wariables, then the empirical loss is a random variable. It is an estimator of the expected loss.
To go a bit further: your dream is to minimize the expected loss, but not knowing $p(x,y)$, you are doomed to minimize the empirical risk.
$\square$