How does this optimal classifier make sense in case of continuous random variable?

Question

How does this optimal classifier make sense in case of continuous random variable?

161 Views Asked by Bumbble Comm At 28 Mar 2026 - 11:42

I'm reading about The Bayes Problem in textbook A Probabilistic Theory of Pattern Recognition by Devroye et al.

They make use of $\eta(x)=\mathbb{P}\{Y=1 \mid X=x\}$ throughout the proof.

In my understanding, the conditional probability $\eta(x)=\mathbb{P}\{Y=1 \mid X=x\}$ is defined only when $\mathbb P \{X=x\} > 0$. If $X$ is continuous, for example, $X$ follows normal distribution, then $\mathbb P[X=x]=0$ for all $x \in \mathbb R$. Then $\eta(x)$ is undefined for all $x \in \mathbb R$, confusing me.

Could you please elaborate on this point?

Original Q&A

There are 3 best solutions below

Bumbble Comm On 10 Sep 2020 - 3:49

Im not sure I understand your question so please let me know if i havent answered it: I believe you have a misunderstanding about $\eta$. It is the probability that $Y=1$ given the value of $X$, so it is in general not $0$, even in the example you gave.

Building on your example: let $Y$ be distributed as bernoulli with parameter $p$ and independent of $X$, then $\eta(x) =p$ not 0.

That is a great book by the way. Lots of interesting problems in there.

Bumbble Comm On 10 Sep 2020 - 9:50

I think it's a great question. Here is one answer, or at least a partial answer. Suppose that $f$ is a joint PDF - PMF for $X$ and $Y$, so that $$f(x, y) \Delta x \approx P(X \in [x, x+\Delta x] \text{ and } Y = y).$$ Then the expression $P(Y = 1 \mid X = x)$ can be defined to mean $\frac{f(x, 1)}{f(x,0) + f(x,1)}$. Why is this a reasonable definition? Intuitively, because if $\Delta x$ is a small positive number then $P(Y = 1 \mid X = x)$ should be approximately equal to \begin{align} P(Y = 1 \mid X \in [x,x+ \Delta x]) &= \frac{P(Y = 1, X \in [x,x+ \Delta x])}{P(X \in [x,x+ \Delta x])} \\ &\approx \frac{f(x,1) \Delta x}{f(x,0) \Delta x + f(x,1) \Delta x} \\ &= \frac{f(x,1)}{f(x,0) + f(x,1)}. \end{align} I'm not fully satisfied with this explanation, though.

**Bumbble Comm** · Accepted Answer

Some comments:

You can get intuition from assuming that the set up is that $(X,Y)$ is some process where $Y$ is sampled from a distribution that depends on the realization of $X$. For instance, maybe $X \sim Unif([0,1])$, and $Y$ is a sample from an independent coin with bias $X$. Conditioned on $X = 1/2$, $Y$ is a fair coin. This is pretty close to the learning theory context anyway -- there are some features, $X$, and the class $Y$ is some random function of the features.

This situation is also essentially general, in a way that is made precise in 3. So, there's really no harm in imagining that this is the story with the data you are trying to learn a classifier for. (Since $Y$ is a binary random variable, you can skip to 5.)
If $(X,Y)$ has a continuous pdf $p(x,y)$, then you can define $p_x(y) = \frac{ p(x,y)}{ \int_{\mathbb{R}} p(x,y) dy }$ as the pdf of $Y$ conditioned on $X = x$. You need that the integral in the denominator is nonzero, but this is a weaker condition than $P(X = x) > 0$. In this specific case, $Y$ is a binary variable, so we'd have $p_x(y) = \frac{ p(x,y)}{p(x,0) + p(x,1)}$. See wikipedia for more, though I'll now discuss some of the formalism.
You can define a notion of conditional probability for measure zero sets, called disintegration of measure. Its really not necessary for learning theory, and since building it in general is pretty technical, I wouldn't worry about it unless it interests you (if it does, then the survey on wikipedia by Chang and Pollard is worth reading, as is Chapter 5 in Pollard's "User's Guide"). One important comment though is that you have to build up all of the conditional distributions at once, they are defined a.e. as a family in the distribution over $X$. Otherwise, you have problems like this: https://en.wikipedia.org/wiki/Borel%E2%80%93Kolmogorov_paradox

You can verify that $p_x(y)$ as defined above actually gives a disintegration. I'm not sure what conditions are necessary for this to hold, other than that $p_x(y)$ is well defined, and all the integrals you write down in that verification make sense. In particular, I don't think that $p(x,y)$ needs to be a continuous pdf, but would want to find a reference to double check.

Here's a sketch of the verification, for notation $\mu_x, \nu$ see wikipedia. (Note that there is some notation class -- what they call $Y$ is here called $X \times Y$): The pushforward measure is $d \nu(x) = (\int_{\mathbb{R}} p(x,y) dy) dx$. $\mu_x(y) = p_x(y) dy$ on the fiber $\{x\} \times \mathbb{R}$. When you plug this into the formula from wikipedia , $\int_X (\int_{\pi^{-1}(x)} f(x,y) d \mu_x(y) ) d\nu(x)$, you get:

$$\int_{\mathbb{R}} \int_{\mathbb{R}} f(x,y) \frac{ p(x,y)}{ \int_{\mathbb{R}} p(x,y) dy } dy (\int_{\mathbb{R}} p(x,y) dy) dx = \int_{\mathbb{R}^2} f(x,y) p(x,y) dxdy.$$

From the learning theory point of view, I think it makes sense to imaging fixing a disintegration, and treating that as the notion of conditional probability for $Y$. Even though it is only defined a.e. in $X$, you are not classifying some arbitrary $X$, but one produced from the distribution. Thus, you'll never 'see' disagreements between two different fixed choices of disintegrations. In particular, you can take particularly nice disintegrations given by the formula $p_x(y)$. Also, this means you can treat your distribution as if it is of the kind described in the first bullet.
If $Y$ is a $\{0,1\}$ random variable, $P(Y = 1) = \mathbb{E}[Y]$. Another way that we can define $P ( Y = 1 | X = x) = E [ Y | X = x]$ is via conditioning; the random variable $E [ Y |X ]$ is $\sigma(X)$ measurable, so there is a measurable function $f$ with $E [ Y |X ] = f(X)$. You can then define $E[Y | X = x] = f(x)$. Note that, like disintegration, this is only defined up to almost sure equivalence, since $E[Y|X]$ is only unique up to almost sure equivalence. However, you can pick nice representatives. For instance, if $Y$ is an independent coin flip from $X$ with bias $p$, then $E[Y|X] = p$, so we can take $E[ Y|X = x] = p$.

How does this optimal classifier make sense in case of continuous random variable?

There are 3 best solutions below

Related Questions in PROBABILITY-THEORY

Related Questions in PROOF-EXPLANATION

Related Questions in CONDITIONAL-PROBABILITY

Related Questions in MACHINE-LEARNING

Related Questions in PATTERN-RECOGNITION

Trending Questions

Popular # Hahtags

Popular Questions