machine learning algorithm for e-mail classification

136 Views Asked by At

Ham Vs Spam email algorithm

I'm reading the following book http://alex.smola.org/drafts/thebook.pdf and on page 22 (what's shown in the image) it talks about a simple test which clasifies an e-mail as ham or spam.

I'm unsure exactly what by $x$ and $y$ refer to, when they say

"In the example of the AIDS test we used the outcomes of the test to infer whether the patient is diseased. In the context of spam filtering the actual text of the e-mail $x$ corresponds to the test and the label $y$ is equivalent to the diagnosis. Recall Bayes Rule (1.15)."

What i'm confused about is whether $x$ is assigned 'ham' or 'spam' based on whether the e-mail is or is not and $y$ is assigned 'true (test diagnoses it as spam)' or 'false' as the diagnosis of the test, or whether it's the other way around and $y$ is assigned 'ham' or 'spam' and $x$ is 'true' or 'false'.

In the example of they refer to, this was clear $X$ was the random variable which was assigned a value of AIDS or no AIDS and $T$ was the random variable which represented the outcome of a test positive or negative.

I'd be grateful if someone could help clear this issue up in this instance and in addition clarify exactly what they mean by $\mathbb{P}(x|y).$

1

There are 1 best solutions below

6
On

$y=\{smap,ham\}$ is the variable that denotes the (real, not observed, to be guessed) "state". $x$ is the feature vector (observation, mail content in this case).

What we want to compute (in any classification problem) is $p(y \mid x)$ : probability that the mail is (or not) spam, given the observation.

What i'm confused about is whether $x$ is assigned 'ham' or 'spam' based on whether the e-mail is or is not and $y$ is assigned 'true (test diagnoses it as spam)' or 'false' as the diagnosis of the test, or whether it's the other way around and $y$ is assigned 'ham' or 'spam' and $x$ is 'true' or 'false'.

I don't quite get why you say that the variable "is assigned" this or that value. The computation of all the probabilities are only that: probabilities, they are previous to the classification (guess, labelling). The value of $y$ depends on whether the mail is actually spam or not. The value of $x$ depends on the feature (some measurement, perhaps multimensional, on the observed mail).