Bayesian Spam Classification

Question

Bayesian Spam Classification

338 Views Asked by Bumbble Comm At 10 May 2026 - 8:27

Say I have 1000 e-mails in my inbox. I count the following things

Spam 600, Ham 400
Among Spam Mails: 100 from known senders, 90 contain the word 'credit'.
Among Ham Mails: 200 from known senders, 10 contain the word 'credit'.

So there are 300 mails from known senders and 100 mails that contain the word 'credit'.

I want to calculate P(Spam|Know & Credit), the probability that a mail is spam given that it comes from a known sender and contains the word 'credit'. By Bayes

$$P(S\,|\,K\cap C) = P(S) \frac{P(K\cap C\,|\,S)}{P(K \cap C)}$$

$P(S)=6/10$, and, since I assume independence $P(K\cap C\,|\,S)=P(K\,|\,S)\cdot P(C\,|\,S)$. Since there are 90 spams containing 'credit', and 100 spams from known senders, I have

$$P(K\cap C\,|\,S) = 100/600 \cdot 90/600 = 1/40$$

Now here is where I'm confused:

I assume independence, so I thought $P(K\cap C)=P(K)\cdot P(C )=300/1000\cdot 100/1000=3/100$. However, equally valid should be by the law of total probability

$$P(K\cap C)=P(K\cap C\,|\, S) P(S)+ P(K\cap C\,|\,H)\cdot P(H)$$

and since things are independent I can pull them apart

$$P(K\cap C)=P(K\,|\, S) P(C\,|\, S) P(S)+P(K\,|\, H)P(C\,|\, H)\cdot P(H)$$

Plugging in the values I get

$$P(K\cap C) = 100/600 \cdot 90/600 \cdot 600/1000 + 200/400 \cdot 10/400 \cdot 400/1000 = 1/50$$

Only when I use 1/50 the overall answer makes sense, i.e. I get $P(S\,|\, K\cap C)=1-P(H\,|\, K\cap C)$. Why?

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2014-02-02 11:37:32

Why should independance be justified? If among your acquaintances is your bank account manager, then the occurance of "credit" in mail from known sender may well be above arearage! More specifically, you assume that $K$ and $C$ are independant three times:

"In general", i.e. $P(K\cap C)=P(K)P(C)$;
in case of spam, i.e. $P(K\cap C\mid S)=P(K\mid S)P(C\mid S)$;
and in case of ham, i.e. $P(K\cap C\mid H)=P(K\mid H)P(C\mid H)$.

You cannot expect to have all three if $K$ and $C$ both are indicators (in the positive or negative) of spam vs. ham, i.e. correlated with $S$ (except for a specific overall probability of $S$, which apparently does not match the observed values).

Bayesian Spam Classification

There are 1 best solutions below

Related Questions in BAYES-THEOREM

Related Questions in NAIVE-BAYES

Trending Questions

Popular # Hahtags

Popular Questions