The Bayes optimal predictor is optimal

406 Views Asked by At

I'm attending a lecture in Machine Learning Theory, and since I have almost no background in Probability Theory, I'm having problems with the following exercise:

Let $X$ be a set and $Y = \{0, 1\}$ and $D$ be a probability distribution on $X \times Y$. The Bayes optimal predictor $f: X \to Y$ is defined to be $$f_D(x) := \begin{cases} 1, D((x, 1) \mid x) \geq 1/2 \\ 0, \text{ else } \end{cases}$$

So far so good: $f_D$ just outputs the label that's most present in the ground distribution, given the input. This assumes, I think, that there is also a probability distribution $D_X$ on $X$ such that $D_X(A) = D(A \times Y)$ in order to make sense of the conditional probability, but the lecture was not very explicit about this.

Now I need to show that this predictor is optimal in the following sense. For a function $g: X \to \{0, 1\}$ (probably measurable?) we set the classification error of $g$ to be $$L_D(g) = D((x, y) \mid g(x) \neq y).$$ Now it is to show that $$L_D(f_D) \leq L_D(g).$$

Since I don't know much about general probability theory, I first looked at the discrete case. In that case I get $$L_D(g) = \sum_x \sum_{y, g(x) \neq y} D(x, y)$$ and $$L_D(f_D) = \sum_x \sum_{y, D((x, y) \mid x) < \frac{1}{2}} D(x, y).$$

We just look at this for every $x$ seperately and need to show that the inequality holds for the inner sum. The inner sum for $g$ is just $D(x, y)$, where $y$ satisfies $g(x) \neq y$. We definitely have, assuming what I wrote in the beginning is correct, $D(x, 0) + D(x, 1) = D(\{x\} \times \{0, 1\}) = D_X(x)$, and therefore one of these probabilities is smaller that $\frac{1}{2} D_X(x)$ and one is bigger. That's the same as saying that $D((x, y) \mid x)$ is smaller than $1/2$ for them, and bigger for the other. But in the sum of $f_D$, we always have the summand for the case where this is smaller than $1/2$, and therefore the proof is finished.

Question 1: Is this proof correct?

Question 2: How do I transform this in a proof for the general case? I probably have to learn something about how conditional probabilities are defined in that setting and so on, maybe someone can point me to the most important facts and definitions that will help me?