I'm reading 机器学习 (directly translated - Machine Learning). I'm at a part where the author illustrates how to calculate what he calls Out of Training Error (OTE) - test error in English. He provides the following equation I am trying to understand:
For some algorithm La which is a function some set of training data X, with "perfect" function f (perfect here meaning that it works for all cases if I've understood up to this point) is equal to:
- $\sum_{h}$ = I don't understand this part. H is equal to the hypothesis space (but as far as I can tell is unused). h he defines as some potential hypothesis resulting from using algorithm La on training data X. What I don't understand is what is being summed up. Is it all the possible hypothesis in the hypothesis space?
- $\sum_{x \in \mathcal{X}-X}$ where $\mathcal{X}$ is the hypothesis space and X is the training data (as mentioned above). Is this saying for each training data in the set of the training data minus hypothesis? I don't understand how you could subtract training data from a hypothesis.
Last clarification, he says the notation:
is saying if the result is true than the value is 1 otherwise it is 0. If I understand the rest, it means the probability of the occurrence of x (I'm hazy on this since I don't understand the sum) multiplied by the above mentioned boolean equation multiplied by the probability of h occurring given some training data X and algorithm La.


$\sum_h$ is a sum over all hypotheses in the hypothesis set. I assume this would have to be an integral if your hypothesis space is not discrete.
$\sum_{x \in \mathcal{X} \setminus X}$ is saying that you sum over all of the points in the data space $\mathcal{X}$, that were NOT in your training set $X$. That’s odd to me. I’m used to the generalization error being on points randomly drawn from the data space, using the data distribution $P(x)$, NOT intentionally excluding the training set.
$\mathbb{I}(h(x)\ne f(x))$ is an indicator function that is one when the hypothesis predicts an incorrect label.
$P(h | X, \mathcal{L}_a)$ is the probability of the learning algorithm selecting hypothesis $h$ as the learned hypothesis, given training set $X$.
All of those parts put together give you an equation that is measuring the expected classification error for the learned hypothesis on points randomly drawn points from the data space, where the expectation is over the randomness of the data drawn, with distribution $P(x)$, AND over the randomness of the hypothesis learned by the algorithm, which has distribution $P(h | X, \mathcal{L}_a)$.