I'm looking at the following proof:
Where:
Note: I'm new to this but I think I understand all the below variables correctly now. Mistakes are possible though.
- $f$ is an ideal function with perfect results with respect to some sample set
- $\sum_{h}$ is the hypothesis space
- $\sum_{x \in \mathcal{X}-X}$ is the sum of the entire sample space except the training data
- $P(x)$ is the probability distribution over the sample space
- The binary function in the middle is a selection function for any instance where a hypothesis is selected that is not an ideal function
- $P(h | X, L_a)$ is the probability distribution of selecting hypothesis $h$ given some function/learned model $L_a)$ used on training data $X$
The author says that given some learning function $X -> \{0,1\}$ it's function space would be $\{0,1\}^{|x|}$. He furthermore says
if $f$ is uniformly distributed, half of the predictions of $f$ on $X$ are not consistent with $h(X)$
I do not understand what is meant by this nor how he arrived at that conclusion. Why would it be the case that if $f$ is uniformly distributed that half of its predictions are not consistent with $h(X)$?
After showing this proof he then goes on to say that test error, $E_{ote}$, is independent of the chosen algorithm ($L_a$)

Thanks to @Joe for suggesting the resources to answer this. The below two pictures from "Learning from Data - A Short Course" (which I would suggest) are what made this problem clear to me.
Deriving the 1/2
In the picture $g$ is the selected hypothesis. Let's say that $g$ matches $f_1$. Now we calculate how much disagreement there is with $g$ and all the other $f$s.
The average disagreement then is the sum of $(0+1+1+2+1+2+2+3)/8=1.5$. There are 3 points outside of the training set so in this case our hypothesis disagrees with exactly $1.5/3=1/2$ of the other possible hypothesis. Now to my original question, I asked why does it matter that $f$ is uniformly distributed. Notice that the only reason the above works is that every $f$ completely disagrees with every other $f$ outside the training data. If there were agreement between some of the $f$s the math would no longer hold up.
Deriving the ^2
This is really less of a derivation and more just requires some logical thought about the nature of binary. Look at the chart of the various $f$s mapped. Notice again, assuming that all the $f$s completely disagree with each other, there are exactly $2^n$ which agree with the training data. Think about it logically - if all $f$s must disagree and you only have 3 bits, then there are only 8 possibilities. It does not matter what the training data is, the points outside the training data only have 8 possibilities before they would start to overlap. This scales to any number of points. $n$ could be a billion and this would still be true.