Translation:Bayes Classificator -> precise math?

93 Views Asked by At

I want to understand the most simple form of the Bayes classificator (see here) but I want to understand it in a really precise, clean, mathematical way.

Math description of the setting:

Let us assume that $(\Omega, \mathcal{A}, P)$ is a probability space. We endow $\mathbb{R}$ with the usual Borel sigma algebra $\mathcal{B}$ and assume that $(E, \Sigma)$ is any set with sigma algebra. For a set $A \in \mathcal{B}$ and a map $X : \Omega \to \mathbb{R}$ we write $[X \in A]$ (or sometimes just $X \in A$ without the brackets) for $\{\omega \in \Omega : X(\omega) \in a\}$. Occasionally we also abbreviate $[X=a] := [X \in \{a\}]$.

Let $X_1, X_2 : \Omega \to \mathbb{R}$ and $Z : \Omega \to E$ be measurable functions. $X_1,X_2$ are called conditionally independent given $Z$ iff. for all sets $A, B \in \mathcal{B}$ and all sets $S \in \Sigma$, we have $$P( [X \in A] \cap [Y \in B] ~\mathbf{|}~ Z \in S) = P(X \in A ~\mathbf{|}~ Z \in S) \cdot P(Y \in B~\mathbf{|}~Z \in S)$$

We put $E = \{T, F\}$ with $\Sigma = \mathcal{P}(E) = \{\{\}, \{T\}, \{F\}, \{T,F\}\}$ (T stands for 'True' and 'F' for 'False'). We let $Z$ arise as the pullback to $\Omega$ of a ($\mathcal{B} \otimes \mathcal{B}-\Sigma-$) measurable map $r : \mathbb{R} \times \mathbb{R} \to \{T,F\}$ (r = 'result' that is, a function classifying emails whether they are spam or not or something like this).

* Assumption for doing the Bayes classifier: $X_1, X_2$ are conditionally independent given $Z$ *

The target of the Bayes classifier is the following: given a few values $(a_j, b_j) \in \mathbb{R} \times \mathbb{R}$ and $y_j \in \{T, F\}$ such that $r(a_j, b_j) = y_j$ for $j=1,...,n$, try to predict for every pair $(a,b) \in \mathbb{R} \times \mathbb{R}$ whether $r(a,b)=T$ or $r(a,b)=F$.

This is how people describe the setting in practice:

For a given $x=(a,b)$, compute $P(T|x)$ and $P(F|x)$ and predict '$T$' iff. the first value is bigger than the second.

I do not know what these symbols are supposed to mean but I guess that they mean the following: The first symbol means something like $P(Z=T|X=(a,b))$ with the two symbols '$Z=T$' and '$X=(a,b)$' as above (and the second one analogously with $F$).

Q1: Is this correct?

So now that the symbols make sense we want to compute

$$P(T|(a,b)) = P(Z=T|X=(a,b)) = \frac{P(Z=T \cap X=(a,b))}{P(X=(a,b))}$$

But this seems weird because whenever $X_1, X_2$ are random variables on $\mathbb{R}$ then points usually have probability zero (For example, if they are normally distributed then $P(X = (a,b)) = P_{X}(\{(a,b)\}) = P_{X_1}(\{a\}) \cdot P_{X_2}(\{b\}) = \int_{a}^a ... \cdot \int_{b}^b = 0 \cdot 0 = 0$ [note that setting $S$ to be all of $E$ we get classical independence from the conditional independence]).

Q2: Why isnt $P(X=(a,b)) = 0$ always?

Now we want to compute this value by using the theorem of Bayes:

$$P(Z=T|X=(a,b)) = \frac{P(X=(a,b)|Z=T) \cdot P(Z=T)}{P(X=(a,b))}$$ $$ = \frac{P(Z=T)}{P(X=(a,b))} \cdot P(X_1=a \cap X_2=b|Z=T)$$ $$ = \frac{P(Z=T)}{P(X=(a,b))} \cdot P(X_1=a|Z=T) \cdot P(X_2=b|Z=T)$$ by the conditional independence.

Ignoring the denumerator and noting that we can effectively estimate $P(Z=T)$ by computing 'nr. of T's divided by $n$', we are left with the computation of the two terms on the right.

In this article the author computes them as follows: Given a concrete set of observations $$(a_1, b_1, y_1), ..., (a_m, b_m, y_m)$$ we compute

$$\mu_{1,T} = \text{mean}(\{a_i | i \in \{1,...,n\} ~\text{is such that}~ y_i=T\})$$ $$\mu_{2,T} = \text{mean}(\{b_i | i \in \{1,...,n\} ~\text{is such that}~ y_i=T\})$$ $$\mu_{1,F} = \text{mean}(\{a_i | i \in \{1,...,n\} ~\text{is such that}~ y_i=F\})$$ $$\mu_{2,F} = \text{mean}(\{b_i | i \in \{1,...,n\} ~\text{is such that}~ y_i=F\})$$ $$\sigma_{1,T} = \sqrt{\text{var}(\{a_i | i \in \{1,...,n\} ~\text{is such that}~ y_i=T\})}$$ $$\sigma_{2,T} = \sqrt{\text{var}(\{b_i | i \in \{1,...,n\} ~\text{is such that}~ y_i=T\})}$$ $$\sigma_{1,F} = \sqrt{\text{var}(\{a_i | i \in \{1,...,n\} ~\text{is such that}~ y_i=F\})}$$ $$\sigma_{2,F} = \sqrt{\text{var}(\{b_i | i \in \{1,...,n\} ~\text{is such that}~ y_i=F\})}$$ and put $$ f(x|\mu,\sigma) = \frac{1}{\sqrt{2\pi \sigma}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2} \right)$$

Then $$P(X_1=a|Z=T) = f(a|\mu_{1,T}, \sigma_{1,T})$$
$$P(X_2=b|Z=T) = f(b|\mu_{2,T}, \sigma_{2,T})$$ and so forth.

Q3: I guess this is the same issue as above: Assuming that $X_1$ is normally distributed 'on' $[Z=T]$ does not mean that single points have a propability (in particular, why is the propability of a point a value of the density function? This seems weird to me...)?

Cheers & THX in advance,

FW

1

There are 1 best solutions below

8
On BEST ANSWER

In a naive Bayes classifier, you usually need to compute $P(X | C_i)$. In your context, they are $P(X | T)$ and $P(X | F)$. When $X$ is a discrete attribute, both $P(X | T)$ and $P(X | F)$ are easy to estimate based on the training data. However, if $X$ is a continuous attribute, there is a technical difficulty. As you've stated, $P(X = x | F) = P(X = x | T) = 0$. This is highly not expected. To address this problem, the trick is assume that $$ P(X = x | T) = f(x | \mu_T, \sigma_T)\\ P(X = x | F) = f(x | \mu_F, \sigma_F) $$ , such that $P(X = x | T)$ and $P(X = x | F)$ won't equal to $0$. $f$ here is the same as the one in your questions. Note that $P$ here is not a valid probability any more. In fact, I don't really understand the theoretical reason for choosing Gaussian. But intuitively, if Gaussian is used, the closer the $X$ values is to the mean, the more weight is assigned (i.e., larger $P(X | T)$). This should be expected since in training data, it's very likely that most of X values are near mean, thus more weight.

EDIT:

$P(X | C_i)$, in my opinion, doesn't have to be a probability. Rather, a value representing weight is enough. When predicting the class label for a new input, we predict the label is $T$ if $\frac{P(T | X)}{P(F | X)} > 1$. Thus in fact the ratio of $\frac{P(X | T)}{P(X | F)}$ matters.