Trouble understanding conditional probability and bayesian optimal predictor in machine learning

36 Views Asked by At

I've just finished my second year in a maths degree and was interested in machine learning. I'm reading Understanding Machine Learning: from Theory to Algorithms by Shalev-Shwartz and Ben-David and I got pretty confused after trying to do exercise 3.7. I want to stress that I understand everything intuitively, I'm having trouble figuring out how I could formalize what's being said.

We have a probability distribution $D$ over $\mathcal{X}\times\{0,1\}$ and the book defines the bayesian predictor $f_D$ by \begin{equation} f_D(x)= \left\{\begin{array}{@{}l@{}} 1 \text{ if } \mathbb{P}[y=1|x]\geq1/2\\ 0 \text{ otherwise} \end{array}\right.\,. \end{equation} and I'm asked to prove that for any other classifiers $h:\mathcal{X}\longrightarrow\{0,1\}$, it holds that $L_D(f_D)\leq L_D(h)$, where the loss or the risk of a classifier is defined as $L_D(h)=\mathbb{P}[h(x)\neq y]=D(\{(x,y):h(x)\neq y\})$.

After some time I noticed that I don't even fully understand how conditional probability is being used here or what it's supposed to mean.

When I was introduced to conditional probability, we would have our probability function $\mathbb{P}:\Omega\longrightarrow \mathbb{R}$ and given $A, B\subset\Omega$ we defined $\mathbb{P}(A|B)=\frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}$. But for this, we need $\mathbb{P}(B)\neq 0$. And when I apply it here I get $\mathbb{P}[y=1|x]=\frac{\mathbb{P}[y=1 \text{ and } x]}{\mathbb{P}[x]}=\frac{D(\{(x,1)\})}{D(\{(x,0),(x,1)\})}$ and this makes zero sense to me because in normal circumstances, when $\mathcal{X}$ has a continuity of posible values, the probability of a single point (or two points) is $0$.

I think I'm missing something. Is there a second probability distribution $D_x:\{0,1\}\longrightarrow\mathbb{R}$ for each $x\in \mathcal{X}$? How does it relate to $D$? I thought there may be more probability distributions than I thought because in places like wikipedias page on Bayes classifier they use equalities I've never seen such as $\mathbb{E}_{XY}[G(X, Y)]=\mathbb{E}_X\mathbb{E}_{Y|X}[G(X, Y)]$ for some function $G$. This implies that there must be yet another probability distribution over $\mathcal{X}$.

While I was writing this I found Wikipedia's pages on conditional probability and conditional expectation. I haven't been taught any of this, should I look into it to better understand what I'm doing?