Bayes classification with a simple example of mail classification

Question

Bayes classification with a simple example of mail classification

496 Views Asked by user66280 At 10 May 2026 - 7:30

Every mail is described by a bag of words: $x = (x_1, . . . , x_l)$, where $x_i \in \{0, 1\}$ indicates whether the $i$th word is present or not. We have $n$ training samples ${(x^1,y^1),....(x^n,y^n)}$, where "$y$" indicates if the mail is relevant or not relevant, and we want to classify mails accordingly as either relevant or not relevant.

Task 1: Determine joint distribution, prior and the class conditional distributions $P(x_i|y)$?

Task 2: Consider the class posterior distribution $P(y | x)$ and assume that the cost $c_{1 \to 0}$ for classifying a relevant message as irrelevant is larger than the cost $c_{0 \to 1}$ of classifying an irrelevant message as relevant. The cost of classifying correctly is assumed to be zero. How does the classication rule change?

Edit of my answers --- Task 1:

We can think of this as a Bernoulli trial, where a word $w_i$ is either in the document or it is not. Hence we get

$\operatorname{argmax}_y \prod_{i=1}^n P(x_i|y) = \prod_{i=1}^n P(w_i|k)^{x_i} \cdot (1-P(x_i|y))^{1-x_i}$

$x_i$ is the binary variable indicating if the word $w_i$ is present or not.

With Maximum Likelihood, we can estimate $P(y)$ as the fraction of the documents belonging to the corresponding class. The class-conditional distribution can be calculated similarly: For instance $P(x=x1|y)$, are the fraction of "$x1$" datasamples in class $y$.

Questions:

Is $P(w_i|k)^{x_i} \cdot (1-P(x_i|y))^{1-x_i}$ the joint distribution (it looks like a conditional distribution)?
Any hints for task2?

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

Task 1

I will try to make a summary of a Naive Bayes classifier and to explain it for the email classifier problem.

Goal: Classify an email $x=(x_1, \cdots, x_n)$ as relevant ($y=1$) or irrelevant ($y=0$). The goal is therefore to estimate the probability $\mathbb P(y | x)$, which is also called posterior since we evaluate this probability after having seen the email $x$.

Method: To evaluate $\mathbb P(y | x)$ we use the Bayes rule which states that $$\mathbb P(y | x) \mathbb P(x) = \mathbb P(y \cap x) = \mathbb P(x|y)\mathbb P(y)$$

Therefore $$\mathbb P(y | x) = \dfrac{\mathbb P(x|y)\mathbb P(y)}{\mathbb P(x)}$$

However, $\mathbb P(x)$ is fixed because $x$ is fixed. Therefore it is not interesting. And actually we are rather interested in $\textrm{argmax}_y\mathbb P(y | x)$ (to choose between the labels $y=0$ or $y=1$), and therefore we are interested in computing

$$\textrm{argmax}_y\mathbb P(y | x) = \textrm{argmax}_y\mathbb P(x|y)\mathbb P(y)$$

The quantity $\mathbb P(y)$ is called the prior. It is the probability that an email is of the class $y$ if you do not have any additional information (namely you do not know the email yet). To evaluate this prior we consider all the emails that we have at our disposal: let denote them $x^{(1)}, \cdots, x^{(m)}$ and we count how many of them correspond to the class $y=0$ and how many of them correspond to the class $y=1$. This gives the probabilities $\mathbb P(y=0)$ and $\mathbb P(y=1)$.

Now we still have to evaluate the probability $\mathbb P(x|y)$, which is called the likelihood. To do that we use a bag-of-words approach: $x=(x_1, \cdots, x_n)$ as you said in your question. $x_i \in \{0,1\}$ denotes the presence of the word $w_i$ in the email. We get

$$\mathbb P(x|y)=\mathbb P(x_1 \cap \cdots \cap x_n|y)$$

At this point we use the naive (and incorrect) approximation that the $x_i$ are independent conditioned to $y$ to write

$$\mathbb P(x|y)=\mathbb P(x_1 \cap \cdots \cap x_n|y)=\prod_{i=1}^n \mathbb P(x_i|y)$$

To evaluate the probabilities $\mathbb P(x_i|y)$ you only have to count on your training set of emails. For example to compute $\mathbb P(x_1|0)$ you take all your emails corresponding to $y=0$ and you count the number of these emails that contain the word $x_1$.

Finally this answers to the questions of task 1, since we have determined the prior and conditional distributions. The joint distribution $\mathbb P(x \cap y)$ is simply $\mathbb P(x|y)\cdot \mathbb P(y)$ and we have already computed these quantities.

Task 2

If both costs are equal, we "pay" the same quantity if we do a classification error. Therefore we classify an email $x$ to the class $0$ if $\mathbb P(y=0 |x) > \mathbb P(y=1 |x)$.

In the case where the costs are not the same, let us denote $c_{1 \to 0}$ the cost of classifying a relevant message as irrelevant and $c_{0 \to 1}$ the cost of classifying an irrelevant message as relevant.

Let us now compute the average cost of our classifier. Let us denote $J(p, y)$ this cost, where $p=\mathbb P(y=1 |x)$ (the posterior probability that the message is relevant). For example, $J(p, y=1)$ is the average price we have to pay if we classify an email $x$ in the class $y=1$ if this email has a posterior probability of $p$.

We find $J(p, y=1)=(1-p)\cdot c_{0 \to 1}$ because with probability $1-p$ the email is in the class $0$ and we classify it as in the class $1$. In the same manner we find that $J(p, y=0)=p\cdot c_{1 \to 0}$.

Now the question is, if we have an email with $p=\mathbb P(y=1 |x)$, should we put it in the class $0$ or in the class $1$? To know that we will try to minimize our cost. We will chose the $y$ such that $J(p, y)$ is minimal.

To know that, let us solve $J(p, y=1)<J(p, y=0)$, i.e. $(1-p)\cdot c_{0 \to 1} < p\cdot c_{1 \to 0}$. We get $p \cdot (c_{1 \to 0} + c_{0 \to 1}) >c_{0 \to 1}$, i.e. $$p> \dfrac{c_{0 \to 1}}{c_{1 \to 0} + c_{0 \to 1}}$$

Let us denote $h$ this quantity. We have found that if $p>h$, then $J(p, y=1)<J(p, y=0)$ and then we have to classify the email in the class $y=1$.

Note 1: This means that now we do not look for $\text{argmax}_y \mathbb{P}(y|x)$, because making a classification error has not the same cost depending on the class we choose

Note 2: In your question you say that $c_{1 \to 0} > c_{0 \to 1}$. This means that $h < 0.5$. For example $h=0.4$. Consequently you classify more often emails in the class $1$ than in the class $0$. This is normal because misclassifying an email of class $1$ is more expensive than misclassifying an email of class $0$ and therefore you are "scared" of making an error with an email of class $1$ and then you put more emails in the class $1$.

Note 3: Of course if both costs are equal you find that $h=0.5$ as in the first task.

Bayes classification with a simple example of mail classification

There are 1 best solutions below

Task 1

Task 2

Related Questions in PROBABILITY

Related Questions in STATISTICS

Related Questions in BAYESIAN

Related Questions in NAIVE-BAYES

Trending Questions

Popular # Hahtags

Popular Questions