Proving probabilistic classifier is optimal

51 Views Asked by At

In my studies of probability within machine learning, I have come across the following setting:

We have a domain $ \mathcal{X} \times \{0,1\} $ and two random variables $ X \subset \mathcal{X} $ and $ Y \in \{0,1\} $ with a joint probability distribution defined on $ \mathcal{X} \times \{0,1\} $, say $ \mathcal{D} $. We look at label functions of the form $ f : \mathcal{X} \to \{0,1\} $, and we seek a label function $ f $ such that the probability $ \mathcal{D}(\mathcal{X}=x,\mathcal{Y}=y : y \neq f(x)) $ is minimal. My textbook says that this is the Bayes classifier, which is:

$ f(x) = \left\{ \begin{array}{ll} 1 & \mbox{if } \mathcal{D}(Y=1|X=x) \geq \frac{1}{2} \\ 0 & else \end{array} \right. $

I understand this intuitively, but can someone please provide a rigorous proof of this? I thank all helpers.

1

There are 1 best solutions below

1
On BEST ANSWER

Since $D(Y\neq f(X))=1-D(Y= f(x))$, we can also focus on maximizing the latter in order to minimize the former (I am simplifying notation here a bit, feel free to let me know if it is unclear). We have that

\begin{align*} D(Y= f(X))&=\sum_{x\in X}\sum_{y=f(x)}D(X=x,Y=y)\\ &=\sum_{x\in X} D(x)\sum_{y=f(x)}D(Y=y|X=x)\\ &=\sum_{x\in X} D(x) D(Y=f(x)|X=x) \end{align*}

We thus have that \begin{align*} \sup_{f}D(Y= f(X))&=\sup_{f}\sum_{x\in X} D(x) D(Y=f(x)|X=x)\\ &=\sum_{x\in X} D(x) \sup_{f}D(Y=f(x)|X=x) \end{align*}

In other words we are trying to find the function $f$ that maximizes $D(Y=f(x)|X=x)$ for all $x$. That function is precisely

$$f(x)=\text{argmax}_{y=0,1}D(Y=y|X=x)$$

where by argmax I mean that you maximize the function but then return the argument (i.e. the $y$) that gave you the maximum instead of the maximum itself. Since $D(Y=0|X=x)+D(Y=1|X=x)=1$, this is exactly the Bayes classifier that was given.