I'm reading about The Bayes Problem in textbook A Probabilistic Theory of Pattern Recognition by Devroye et al.
They make use of $\eta(x)=\mathbb{P}\{Y=1 \mid X=x\}$ throughout the proof.
In my understanding, the conditional probability $\eta(x)=\mathbb{P}\{Y=1 \mid X=x\}$ is defined only when $\mathbb P \{X=x\} > 0$. If $X$ is continuous, for example, $X$ follows normal distribution, then $\mathbb P[X=x]=0$ for all $x \in \mathbb R$. Then $\eta(x)$ is undefined for all $x \in \mathbb R$, confusing me.
Could you please elaborate on this point?



Some comments:
You can get intuition from assuming that the set up is that $(X,Y)$ is some process where $Y$ is sampled from a distribution that depends on the realization of $X$. For instance, maybe $X \sim Unif([0,1])$, and $Y$ is a sample from an independent coin with bias $X$. Conditioned on $X = 1/2$, $Y$ is a fair coin. This is pretty close to the learning theory context anyway -- there are some features, $X$, and the class $Y$ is some random function of the features.
This situation is also essentially general, in a way that is made precise in 3. So, there's really no harm in imagining that this is the story with the data you are trying to learn a classifier for. (Since $Y$ is a binary random variable, you can skip to 5.)
If $(X,Y)$ has a continuous pdf $p(x,y)$, then you can define $p_x(y) = \frac{ p(x,y)}{ \int_{\mathbb{R}} p(x,y) dy }$ as the pdf of $Y$ conditioned on $X = x$. You need that the integral in the denominator is nonzero, but this is a weaker condition than $P(X = x) > 0$. In this specific case, $Y$ is a binary variable, so we'd have $p_x(y) = \frac{ p(x,y)}{p(x,0) + p(x,1)}$. See wikipedia for more, though I'll now discuss some of the formalism.
You can define a notion of conditional probability for measure zero sets, called disintegration of measure. Its really not necessary for learning theory, and since building it in general is pretty technical, I wouldn't worry about it unless it interests you (if it does, then the survey on wikipedia by Chang and Pollard is worth reading, as is Chapter 5 in Pollard's "User's Guide"). One important comment though is that you have to build up all of the conditional distributions at once, they are defined a.e. as a family in the distribution over $X$. Otherwise, you have problems like this: https://en.wikipedia.org/wiki/Borel%E2%80%93Kolmogorov_paradox
You can verify that $p_x(y)$ as defined above actually gives a disintegration. I'm not sure what conditions are necessary for this to hold, other than that $p_x(y)$ is well defined, and all the integrals you write down in that verification make sense. In particular, I don't think that $p(x,y)$ needs to be a continuous pdf, but would want to find a reference to double check.
Here's a sketch of the verification, for notation $\mu_x, \nu$ see wikipedia. (Note that there is some notation class -- what they call $Y$ is here called $X \times Y$): The pushforward measure is $d \nu(x) = (\int_{\mathbb{R}} p(x,y) dy) dx$. $\mu_x(y) = p_x(y) dy$ on the fiber $\{x\} \times \mathbb{R}$. When you plug this into the formula from wikipedia , $\int_X (\int_{\pi^{-1}(x)} f(x,y) d \mu_x(y) ) d\nu(x)$, you get:
$$\int_{\mathbb{R}} \int_{\mathbb{R}} f(x,y) \frac{ p(x,y)}{ \int_{\mathbb{R}} p(x,y) dy } dy (\int_{\mathbb{R}} p(x,y) dy) dx = \int_{\mathbb{R}^2} f(x,y) p(x,y) dxdy.$$
From the learning theory point of view, I think it makes sense to imaging fixing a disintegration, and treating that as the notion of conditional probability for $Y$. Even though it is only defined a.e. in $X$, you are not classifying some arbitrary $X$, but one produced from the distribution. Thus, you'll never 'see' disagreements between two different fixed choices of disintegrations. In particular, you can take particularly nice disintegrations given by the formula $p_x(y)$. Also, this means you can treat your distribution as if it is of the kind described in the first bullet.
If $Y$ is a $\{0,1\}$ random variable, $P(Y = 1) = \mathbb{E}[Y]$. Another way that we can define $P ( Y = 1 | X = x) = E [ Y | X = x]$ is via conditioning; the random variable $E [ Y |X ]$ is $\sigma(X)$ measurable, so there is a measurable function $f$ with $E [ Y |X ] = f(X)$. You can then define $E[Y | X = x] = f(x)$. Note that, like disintegration, this is only defined up to almost sure equivalence, since $E[Y|X]$ is only unique up to almost sure equivalence. However, you can pick nice representatives. For instance, if $Y$ is an independent coin flip from $X$ with bias $p$, then $E[Y|X] = p$, so we can take $E[ Y|X = x] = p$.