machine learning algorithm for e-mail classification

Question

machine learning algorithm for e-mail classification

136 Views Asked by Bumbble Comm At 04 Apr 2026 - 8:38

I'm reading the following book http://alex.smola.org/drafts/thebook.pdf and on page 22 (what's shown in the image) it talks about a simple test which clasifies an e-mail as ham or spam.

I'm unsure exactly what by $x$ and $y$ refer to, when they say

"In the example of the AIDS test we used the outcomes of the test to infer whether the patient is diseased. In the context of spam filtering the actual text of the e-mail $x$ corresponds to the test and the label $y$ is equivalent to the diagnosis. Recall Bayes Rule (1.15)."

What i'm confused about is whether $x$ is assigned 'ham' or 'spam' based on whether the e-mail is or is not and $y$ is assigned 'true (test diagnoses it as spam)' or 'false' as the diagnosis of the test, or whether it's the other way around and $y$ is assigned 'ham' or 'spam' and $x$ is 'true' or 'false'.

In the example of they refer to, this was clear $X$ was the random variable which was assigned a value of AIDS or no AIDS and $T$ was the random variable which represented the outcome of a test positive or negative.

I'd be grateful if someone could help clear this issue up in this instance and in addition clarify exactly what they mean by $\mathbb{P}(x|y).$

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2015-11-09 21:35:44

$y=\{smap,ham\}$ is the variable that denotes the (real, not observed, to be guessed) "state". $x$ is the feature vector (observation, mail content in this case).

What we want to compute (in any classification problem) is $p(y \mid x)$ : probability that the mail is (or not) spam, given the observation.

What i'm confused about is whether $x$ is assigned 'ham' or 'spam' based on whether the e-mail is or is not and $y$ is assigned 'true (test diagnoses it as spam)' or 'false' as the diagnosis of the test, or whether it's the other way around and $y$ is assigned 'ham' or 'spam' and $x$ is 'true' or 'false'.

I don't quite get why you say that the variable "is assigned" this or that value. The computation of all the probabilities are only that: probabilities, they are previous to the classification (guess, labelling). The value of $y$ depends on whether the mail is actually spam or not. The value of $x$ depends on the feature (some measurement, perhaps multimensional, on the observed mail).

machine learning algorithm for e-mail classification

There are 1 best solutions below

Related Questions in STATISTICS

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions