In a supervised machine learning setup, one usually considers an underlying measurable space $(\Omega, \mathcal{F}, \Bbb P)$ and random vectors/variables $X:\Omega \rightarrow \Bbb R^n, Y: \Omega \rightarrow \Bbb R.$ We can then consider the probability distribution of $(X,Y),$ denoted as $\Bbb P_{X,Y}.$ For a loss function $\ell: \Bbb R \times \Bbb R \rightarrow \Bbb R,$ the corresponding risk of a measurable functional $f: \Bbb R^n \rightarrow \Bbb R$ is then defined as
$$R(f): = E_{\Bbb P_{X,Y}}\left[\ell(f(X), Y)\right],$$ where $E_{\Bbb P_{X,Y}}$ denotes the expectation with respect to the probability measure $P_{X,Y}.$ The Bayes risk is defined as
$$R^* := \inf \{R(f) \mid f: \Bbb R^n \rightarrow \Bbb R \textrm{ measurable}\}$$ and any measurable $f^*$ for which $R(f^*) = R^*$ is called a target function.
I many textbooks and courses on the topic, one can find the following statements:
a) if $(X,Y)$ is absolutely continuous and $\ell(y, \hat{y}): = (y - \hat{y})^2,$ then $$f^*(x) = E_{\Bbb P}[Y \mid X = x].$$ b) if $Y$ is discrete, say $Y \in\{1, \ldots, K\}$ with probability one, and $\ell(y, \hat{y}):= \left\{ \begin{array}{ll} 0, & \textrm{if }y = \hat{y} \\ 1, & \textrm{otherwise, } \\ \end{array}\right.$ then
$$f^*(x) = \textrm{argmax}\{\Bbb P(Y = k\mid X = x) \mid k \in \{1, \ldots, K\}\}.$$
I have the following questions:
In a), I am aware of a correct definition of $f^*$ coming from the (measure- theoretic) concept of conditional expectation. Specifically, using Radon-Nikodym's theorem (and some additional assumptions), one can show that there is a measurable $f^*$ (unique a.s) for which the risk $R$ is minimised, and then by definition $E_{\Bbb P}[Y \mid X = x] := f^* \circ X$ However, in all of these books/courses there is no mention of this proper definition, nor they give a satisfactory alternative as a definition. How is it possible to work in a (computer science) class with these constructions then? Is there some kind of non-spoken truth among computer scientists that I am not aware of? Am I alone in this feeling? This makes me believe that they have a way to look at these constructions that I am just not familiar with. How should I look at these things then? Reading computer science literature on the topic is a pain as I just can't trust what I am reading.
In b), $f^*$ is simply not well defined if $X$ is absolutely continuous (in this case, the event (X = x) has probability zero and thus conditional probability is not defined). Again, nobody asks these kind of questions in the lectures. How should I look at it?
Can you please provide a (simple) reference treating these topics in a rigorous fashion? My background is in optimization, so I am not very familiar with the prob/stats literature.
It sounds like you know that the function $f^*$ is the one that minimizes the (square of the) $L^2$-norm $$ \mathbb E[(f^*(X)-Y)^2] = \inf\Big\{\mathbb E[(f(X)-Y)^2]\,\Big|\,f:\mathbb R^n\to\mathbb R\text{ measurable }\Big\} $$ This is one of the two standard definitions of conditional expectation $\mathbb E[Y|X=x]\,.$ The link to the measure theoretic concept of conditional expectation is provided by the Doob-Dynkin lemma: $$ f^*(x)=\mathbb E[Y|X=x]\,,\quad\text{ in other words, }\quad f^*(X)=\mathbb E[Y|X]\,. $$ Thoughts about missing mathematcial rigor in, say, engineering literature could fill an entire book. On the contrary, how about missing intuition in too abstract math literature? I made the experience that concepts take time to sink in whether they come from applied maths or pure maths. Not trusting what you are reading isn't the worst attitude. Keep it!
If $Y$ is discrete and $X$ continuous the conditional probability in (b) could be a well-defined special case of $\mathbb E[Y|X=x]$ above: $$ \mathbb P[Y=k|X=x]=\mathbb E\Big[1_{\{Y=k\}}\Big|X=x\Big]\,. $$ It is more likely that $Y$ and $X$ are both discrete. Then the conditional probability is defined as usual $$ \mathbb P[Y=k|X=x]=\frac{\mathbb P[Y=k,X=x]}{\mathbb P[X=x]}\,. $$ What I find more interesting is that $$\tag{1} f^*(x)=\text{arg}\max\limits_{k}\Big\{\mathbb P[Y=k|X=x]\Big\} $$ is the function that minimizes (if I got your definition right) $$ \mathbb E\Big[1_{\{f(X)\not=Y\}}\Big]=\mathbb P[f(X)\not=Y]\,. $$ This is equivalent to maximizing $\mathbb P[f(X)=Y]\,.$ From this point of view (1) is intuitively clear. I recommend to consider two rvs $X,Y$ with finite values and check if that intuiton is correct. This should be straightforward and hopefully make some further reading superfluous.
Since machine learning is a relativley new branch of applied maths I find it unlikely that there is a literature that treats these topics with your favourite level of mathematical rigor. In mathematical finance for example, that became a hype starting in the 1980s, it took years to develop such literature. In my humble opinion mathematically rigorous literature dealing with applied maths always follows the development rather than being ahead of it.