Naive Bayes problem applied to text

92 Views Asked by At

Assume that you are using a Naïve Bayes classifier to classify some documents into two classes, Sports and Health docs. Assume that there are only $5$ words used in your model. Let us denote these 5 features as $w_1, w_2, w_3, w_4$ and $w_5$.

$$p(w_1 |Sports )=0.3$$ $$p(w_2 |Sports )=0.2$$ $$p(w_3 |Sports )=0.05$$ $$p(w_4 |Sports )=0.4$$ $$p(w_5 |Sports )=0.05$$

$$p(w_1 |Health)=0.05$$ $$p(w_2 |Health )=0.3$$ $$p(w_3 |Health )=0.5$$ $$p(w_4 |Health )=0.1$$ $$p(w_5 |Health )=0.05$$

$$p(Sports )= \frac{\text{number of Sports documents}}{\text{total number of documents}} = 0.65$$

$$p(Health )= \frac{\text{number of Health documents}}{\text{total number of documents}} = 0.35$$

compute $p(Sports|w_1,w_2 )$ and $p(Health|w_1,w_2 )$.

Show the derivation of your answer step by step.

Based on the computed probabilities, which category (Health vs. Sports) do you think this document belongs to?

Can anyone help me understand how to solve this problem?

1

There are 1 best solutions below

0
On BEST ANSWER

The Naive Bayes Classifier assumption leads to,

$$P(w_1, w_2, \ldots, w_5 | y) = \prod_{i=1}^{5}P(w_i|y)$$

where $y = \text{Sports/Health}$.

$$P(\text{Sports}|w_1, w_2) = \frac{P(w_1, w_2 | \text{Sports})P(\text{Sports})}{P(w_1, w_2)} = \frac{P(w_1 | \text{Sports})P(w_2 | \text{Sports})P(\text{Sports})}{P(w_1, w_2)}$$

$$P(\text{Health}|w_1, w_2) = \frac{P(w_1, w_2 | \text{Health})P(\text{Health})}{P(w_1, w_2)} = \frac{P(w_1 | \text{Health})P(w_2 | \text{Health})P(\text{Health})}{P(w_1, w_2)}$$

where,

$$\begin{align} P(w_1, w_2) &= P(w_1, w_2 | \text{Sports})P(\text{Sports}) + P(w_1, w_2 | \text{Health})P(\text{Health}) \\\\ &= P(w_1 | \text{Sports})P(w_2 | \text{Sports})P(\text{Sports}) + P(w_1 | \text{Health})P(w_2 | \text{Health})P(\text{Health})\end{align}$$

Take it from here.

Edit: Note that,

$$P(y|w_1, w_2, \ldots, w_5) = \frac{P(y)\prod_{i=1}^{5}P(w_i|y)}{P(w_1, w_2, \ldots, w_5)}$$

where $y = \text{Sports/Health}$.