How does this optimal classifier make sense in case of continuous random variable?

161 Views Asked by At

I'm reading about The Bayes Problem in textbook A Probabilistic Theory of Pattern Recognition by Devroye et al.

enter image description here enter image description here

They make use of $\eta(x)=\mathbb{P}\{Y=1 \mid X=x\}$ throughout the proof.

enter image description here


In my understanding, the conditional probability $\eta(x)=\mathbb{P}\{Y=1 \mid X=x\}$ is defined only when $\mathbb P \{X=x\} > 0$. If $X$ is continuous, for example, $X$ follows normal distribution, then $\mathbb P[X=x]=0$ for all $x \in \mathbb R$. Then $\eta(x)$ is undefined for all $x \in \mathbb R$, confusing me.

Could you please elaborate on this point?

3

There are 3 best solutions below

14
On BEST ANSWER

Some comments:

  1. You can get intuition from assuming that the set up is that $(X,Y)$ is some process where $Y$ is sampled from a distribution that depends on the realization of $X$. For instance, maybe $X \sim Unif([0,1])$, and $Y$ is a sample from an independent coin with bias $X$. Conditioned on $X = 1/2$, $Y$ is a fair coin. This is pretty close to the learning theory context anyway -- there are some features, $X$, and the class $Y$ is some random function of the features.

    This situation is also essentially general, in a way that is made precise in 3. So, there's really no harm in imagining that this is the story with the data you are trying to learn a classifier for. (Since $Y$ is a binary random variable, you can skip to 5.)

  2. If $(X,Y)$ has a continuous pdf $p(x,y)$, then you can define $p_x(y) = \frac{ p(x,y)}{ \int_{\mathbb{R}} p(x,y) dy }$ as the pdf of $Y$ conditioned on $X = x$. You need that the integral in the denominator is nonzero, but this is a weaker condition than $P(X = x) > 0$. In this specific case, $Y$ is a binary variable, so we'd have $p_x(y) = \frac{ p(x,y)}{p(x,0) + p(x,1)}$. See wikipedia for more, though I'll now discuss some of the formalism.

  3. You can define a notion of conditional probability for measure zero sets, called disintegration of measure. Its really not necessary for learning theory, and since building it in general is pretty technical, I wouldn't worry about it unless it interests you (if it does, then the survey on wikipedia by Chang and Pollard is worth reading, as is Chapter 5 in Pollard's "User's Guide"). One important comment though is that you have to build up all of the conditional distributions at once, they are defined a.e. as a family in the distribution over $X$. Otherwise, you have problems like this: https://en.wikipedia.org/wiki/Borel%E2%80%93Kolmogorov_paradox

    You can verify that $p_x(y)$ as defined above actually gives a disintegration. I'm not sure what conditions are necessary for this to hold, other than that $p_x(y)$ is well defined, and all the integrals you write down in that verification make sense. In particular, I don't think that $p(x,y)$ needs to be a continuous pdf, but would want to find a reference to double check.

    Here's a sketch of the verification, for notation $\mu_x, \nu$ see wikipedia. (Note that there is some notation class -- what they call $Y$ is here called $X \times Y$): The pushforward measure is $d \nu(x) = (\int_{\mathbb{R}} p(x,y) dy) dx$. $\mu_x(y) = p_x(y) dy$ on the fiber $\{x\} \times \mathbb{R}$. When you plug this into the formula from wikipedia , $\int_X (\int_{\pi^{-1}(x)} f(x,y) d \mu_x(y) ) d\nu(x)$, you get:

$$\int_{\mathbb{R}} \int_{\mathbb{R}} f(x,y) \frac{ p(x,y)}{ \int_{\mathbb{R}} p(x,y) dy } dy (\int_{\mathbb{R}} p(x,y) dy) dx = \int_{\mathbb{R}^2} f(x,y) p(x,y) dxdy.$$

  1. From the learning theory point of view, I think it makes sense to imaging fixing a disintegration, and treating that as the notion of conditional probability for $Y$. Even though it is only defined a.e. in $X$, you are not classifying some arbitrary $X$, but one produced from the distribution. Thus, you'll never 'see' disagreements between two different fixed choices of disintegrations. In particular, you can take particularly nice disintegrations given by the formula $p_x(y)$. Also, this means you can treat your distribution as if it is of the kind described in the first bullet.

  2. If $Y$ is a $\{0,1\}$ random variable, $P(Y = 1) = \mathbb{E}[Y]$. Another way that we can define $P ( Y = 1 | X = x) = E [ Y | X = x]$ is via conditioning; the random variable $E [ Y |X ]$ is $\sigma(X)$ measurable, so there is a measurable function $f$ with $E [ Y |X ] = f(X)$. You can then define $E[Y | X = x] = f(x)$. Note that, like disintegration, this is only defined up to almost sure equivalence, since $E[Y|X]$ is only unique up to almost sure equivalence. However, you can pick nice representatives. For instance, if $Y$ is an independent coin flip from $X$ with bias $p$, then $E[Y|X] = p$, so we can take $E[ Y|X = x] = p$.

2
On

Im not sure I understand your question so please let me know if i havent answered it: I believe you have a misunderstanding about $\eta$. It is the probability that $Y=1$ given the value of $X$, so it is in general not $0$, even in the example you gave.

Building on your example: let $Y$ be distributed as bernoulli with parameter $p$ and independent of $X$, then $\eta(x) =p$ not 0.

That is a great book by the way. Lots of interesting problems in there.

1
On

I think it's a great question. Here is one answer, or at least a partial answer. Suppose that $f$ is a joint PDF - PMF for $X$ and $Y$, so that $$f(x, y) \Delta x \approx P(X \in [x, x+\Delta x] \text{ and } Y = y).$$ Then the expression $P(Y = 1 \mid X = x)$ can be defined to mean $\frac{f(x, 1)}{f(x,0) + f(x,1)}$. Why is this a reasonable definition? Intuitively, because if $\Delta x$ is a small positive number then $P(Y = 1 \mid X = x)$ should be approximately equal to \begin{align} P(Y = 1 \mid X \in [x,x+ \Delta x]) &= \frac{P(Y = 1, X \in [x,x+ \Delta x])}{P(X \in [x,x+ \Delta x])} \\ &\approx \frac{f(x,1) \Delta x}{f(x,0) \Delta x + f(x,1) \Delta x} \\ &= \frac{f(x,1)}{f(x,0) + f(x,1)}. \end{align} I'm not fully satisfied with this explanation, though.