Rigorous definitions of probabilistic statements in Machine Learning

Question

Rigorous definitions of probabilistic statements in Machine Learning

337 Views Asked by Bumbble Comm At 24 Feb 2026 - 9:50

In a supervised machine learning setup, one usually considers an underlying measurable space $(\Omega, \mathcal{F}, \Bbb P)$ and random vectors/variables $X:\Omega \rightarrow \Bbb R^n, Y: \Omega \rightarrow \Bbb R.$ We can then consider the probability distribution of $(X,Y),$ denoted as $\Bbb P_{X,Y}.$ For a loss function $\ell: \Bbb R \times \Bbb R \rightarrow \Bbb R,$ the corresponding risk of a measurable functional $f: \Bbb R^n \rightarrow \Bbb R$ is then defined as

$$R(f): = E_{\Bbb P_{X,Y}}\left[\ell(f(X), Y)\right],$$ where $E_{\Bbb P_{X,Y}}$ denotes the expectation with respect to the probability measure $P_{X,Y}.$ The Bayes risk is defined as

$$R^* := \inf \{R(f) \mid f: \Bbb R^n \rightarrow \Bbb R \textrm{ measurable}\}$$ and any measurable $f^*$ for which $R(f^*) = R^*$ is called a target function.

I many textbooks and courses on the topic, one can find the following statements:

a) if $(X,Y)$ is absolutely continuous and $\ell(y, \hat{y}): = (y - \hat{y})^2,$ then $$f^*(x) = E_{\Bbb P}[Y \mid X = x].$$ b) if $Y$ is discrete, say $Y \in\{1, \ldots, K\}$ with probability one, and $\ell(y, \hat{y}):= \left\{ \begin{array}{ll} 0, & \textrm{if }y = \hat{y} \\ 1, & \textrm{otherwise, } \\ \end{array}\right.$ then

$$f^*(x) = \textrm{argmax}\{\Bbb P(Y = k\mid X = x) \mid k \in \{1, \ldots, K\}\}.$$

I have the following questions:

In a), I am aware of a correct definition of $f^*$ coming from the (measure- theoretic) concept of conditional expectation. Specifically, using Radon-Nikodym's theorem (and some additional assumptions), one can show that there is a measurable $f^*$ (unique a.s) for which the risk $R$ is minimised, and then by definition $E_{\Bbb P}[Y \mid X = x] := f^* \circ X$ However, in all of these books/courses there is no mention of this proper definition, nor they give a satisfactory alternative as a definition. How is it possible to work in a (computer science) class with these constructions then? Is there some kind of non-spoken truth among computer scientists that I am not aware of? Am I alone in this feeling? This makes me believe that they have a way to look at these constructions that I am just not familiar with. How should I look at these things then? Reading computer science literature on the topic is a pain as I just can't trust what I am reading.
In b), $f^*$ is simply not well defined if $X$ is absolutely continuous (in this case, the event (X = x) has probability zero and thus conditional probability is not defined). Again, nobody asks these kind of questions in the lectures. How should I look at it?
Can you please provide a (simple) reference treating these topics in a rigorous fashion? My background is in optimization, so I am not very familiar with the prob/stats literature.

Original Q&A

There are 2 best solutions below

Bumbble Comm On 16 Jan 2022 - 5:55

Let $\mu_k$ denote Lebesgue measure in $k$ dimensional space. Since $(X,Y)$ is absolutely continuous, its distribution (a Borel measure on the plane defined by $\nu(A):=P[(X,Y) \in A]$) has a density (Radon-Nikodym derivative) $g(x,y)$ (defined $\mu_2$-a.e.) with respect to $\mu_2$. Then $$h(x):=\int_{\mathbb R} h(x,y) \,d\mu_1(y)$$ (defined $\nu_1$-a.e.) is the density of (the distribution $\nu_X$ of) $X$ with respect to $\mu_1$. Now let $$\psi(x,y):=g(x,y)/h(x)$$ (defined $\nu_2$-a.e.) and write it in the form $\psi(y|x)$ as a reminder of its meaning. We claim (with a caveat below) that $\psi(y|x)$ is the conditional density of $Y$ given $X=x$, in the following sense: For any Borel measurable function $\varphi$ on ${\mathbb R}$, we have $$E[\varphi(Y) | X]= \int_{\mathbb R} \varphi(y)\psi(y|x) \,d\mu_1(y) \quad (*)$$ where the left hand side is a conditional expectation in the Kolomogorov sense, that is, we condition on the $\sigma$-field generated by $X$, and the identity $(*)$ is valid $\nu_X$-a.e. To check $(*)$ for a specific $\varphi$, it suffices to verify that for Borel sets $B$ on the line, the integral of both sides on $B$ is the same. The integral of the LHS on $B$ is given by the definition of conditional expectation, and the RHS can be integrated using the definition of $h$, so we are left to check that $$E[\varphi(Y) 1_B(X)]= \int_B\int_{\mathbb R} \varphi(y)\psi(y|x) h(x)\,d\mu_1(y) d\mu_1(y) \,,$$ which holds by the definitions of $\psi$ and $g$. This suffices for the original question where $\phi(y)=y$. The general caveat is that with this argument the exceptional set in $(*)$ may depend on $\varphi$, so if we want to remove one measure zero set and then have $(*)$ hold for all Borel $\varphi$, more care is needed, and we need regular conditional distributions, as discussed e.g. in Durrett's textbook. The idea is that we take a countable collection of functions $\varphi$ generating the Borel algebra, such as indicators of rational intervals, then verify (using Dynkin's $\pi$-$\lambda$ theorem) that once $x$ is chosen so that (*) holds for those indicators when $X=x$, then it holds for all Borel $\varphi$.

Pat (b) is similar, but one starts with the product measure of $\mu_1$ and counting measure, rather than with $\mu_2$.

**Bumbble Comm** · Accepted Answer

It sounds like you know that the function $f^*$ is the one that minimizes the (square of the) $L^2$-norm $$ \mathbb E[(f^*(X)-Y)^2] = \inf\Big\{\mathbb E[(f(X)-Y)^2]\,\Big|\,f:\mathbb R^n\to\mathbb R\text{ measurable }\Big\} $$ This is one of the two standard definitions of conditional expectation $\mathbb E[Y|X=x]\,.$ The link to the measure theoretic concept of conditional expectation is provided by the Doob-Dynkin lemma: $$ f^*(x)=\mathbb E[Y|X=x]\,,\quad\text{ in other words, }\quad f^*(X)=\mathbb E[Y|X]\,. $$ Thoughts about missing mathematcial rigor in, say, engineering literature could fill an entire book. On the contrary, how about missing intuition in too abstract math literature? I made the experience that concepts take time to sink in whether they come from applied maths or pure maths. Not trusting what you are reading isn't the worst attitude. Keep it!
If $Y$ is discrete and $X$ continuous the conditional probability in (b) could be a well-defined special case of $\mathbb E[Y|X=x]$ above: $$ \mathbb P[Y=k|X=x]=\mathbb E\Big[1_{\{Y=k\}}\Big|X=x\Big]\,. $$ It is more likely that $Y$ and $X$ are both discrete. Then the conditional probability is defined as usual $$ \mathbb P[Y=k|X=x]=\frac{\mathbb P[Y=k,X=x]}{\mathbb P[X=x]}\,. $$ What I find more interesting is that $$\tag{1} f^*(x)=\text{arg}\max\limits_{k}\Big\{\mathbb P[Y=k|X=x]\Big\} $$ is the function that minimizes (if I got your definition right) $$ \mathbb E\Big[1_{\{f(X)\not=Y\}}\Big]=\mathbb P[f(X)\not=Y]\,. $$ This is equivalent to maximizing $\mathbb P[f(X)=Y]\,.$ From this point of view (1) is intuitively clear. I recommend to consider two rvs $X,Y$ with finite values and check if that intuiton is correct. This should be straightforward and hopefully make some further reading superfluous.
Since machine learning is a relativley new branch of applied maths I find it unlikely that there is a literature that treats these topics with your favourite level of mathematical rigor. In mathematical finance for example, that became a hype starting in the 1980s, it took years to develop such literature. In my humble opinion mathematically rigorous literature dealing with applied maths always follows the development rather than being ahead of it.

Rigorous definitions of probabilistic statements in Machine Learning

There are 2 best solutions below

Related Questions in PROBABILITY-THEORY

Related Questions in CONDITIONAL-PROBABILITY

Related Questions in CONDITIONAL-EXPECTATION

Related Questions in MACHINE-LEARNING

Related Questions in RADON-NIKODYM

Trending Questions

Popular # Hahtags

Popular Questions