Say I have 1000 e-mails in my inbox. I count the following things
- Spam 600, Ham 400
- Among Spam Mails: 100 from known senders, 90 contain the word 'credit'.
- Among Ham Mails: 200 from known senders, 10 contain the word 'credit'.
So there are 300 mails from known senders and 100 mails that contain the word 'credit'.
I want to calculate P(Spam|Know & Credit), the probability that a mail is spam given that it comes from a known sender and contains the word 'credit'. By Bayes
$$P(S\,|\,K\cap C) = P(S) \frac{P(K\cap C\,|\,S)}{P(K \cap C)}$$
$P(S)=6/10$, and, since I assume independence $P(K\cap C\,|\,S)=P(K\,|\,S)\cdot P(C\,|\,S)$. Since there are 90 spams containing 'credit', and 100 spams from known senders, I have
$$P(K\cap C\,|\,S) = 100/600 \cdot 90/600 = 1/40$$
Now here is where I'm confused:
I assume independence, so I thought $P(K\cap C)=P(K)\cdot P(C )=300/1000\cdot 100/1000=3/100$. However, equally valid should be by the law of total probability
$$P(K\cap C)=P(K\cap C\,|\, S) P(S)+ P(K\cap C\,|\,H)\cdot P(H)$$
and since things are independent I can pull them apart
$$P(K\cap C)=P(K\,|\, S) P(C\,|\, S) P(S)+P(K\,|\, H)P(C\,|\, H)\cdot P(H)$$
Plugging in the values I get
$$P(K\cap C) = 100/600 \cdot 90/600 \cdot 600/1000 + 200/400 \cdot 10/400 \cdot 400/1000 = 1/50$$
Only when I use 1/50 the overall answer makes sense, i.e. I get $P(S\,|\, K\cap C)=1-P(H\,|\, K\cap C)$. Why?
Why should independance be justified? If among your acquaintances is your bank account manager, then the occurance of "credit" in mail from known sender may well be above arearage! More specifically, you assume that $K$ and $C$ are independant three times:
You cannot expect to have all three if $K$ and $C$ both are indicators (in the positive or negative) of spam vs. ham, i.e. correlated with $S$ (except for a specific overall probability of $S$, which apparently does not match the observed values).