Defining a learning task with measure theory

Question

Defining a learning task with measure theory

62 Views Asked by Bumbble Comm At 28 Mar 2026 - 9:50

I'm trying to use the formalism of measure theory to define a learning task. I'm trying to say that a classifier $f: X \rightarrow Y$ tries to approximate the joint probability distribution for $X$ and $Y$.
I don't really know how to define this joint probability distribution, but have the following idea:

Let $(\mathcal{X}, \mathcal{A}, \mu_x)$ and $(\mathcal{Y}, \mathcal{B}, \mu_y)$ be two probability spaces. We define for $\mathcal{X}$ a random variable $X:\mathcal{X} \rightarrow \mathbb{R}$ and for $\mathcal{Y}$ a random variable $Y:\mathcal{Y} \rightarrow \mathbb{R}$. The joint distribution could be defined as a measure on the product space $\Omega = \mathcal{X} \times \mathcal{Y}$: $\mu(\Omega)$. And therefore my classifier would be a function $f: \mathcal{X} \rightarrow \mathcal{Y}$.

Does that sound correct or did I completely misunderstood the whole thing ?

Original Q&A

There are 2 best solutions below

**Bumbble Comm** · Answer 1 · 2019-07-02 01:10:15

More correct would be to say that $Z=(X,Y)$ is a random variable from the product space $(\Omega,\mu_x\otimes \mu_y)$ to $\mathbb R^2$. And it is not that $f$ is trying to approximate the joint probability distribution, but rather the graph of $f$ $$ \textrm{graph}(f):=\{(x,f(x))\colon x\in \mathbb R\} $$ which tries to approximate the support of $Z$.

**Bumbble Comm** · Answer 2 · 2019-07-02 16:27:52

Starting with @pre-kidney answer:

It seems likely that the support of Z will be everything. It would just be peaked along the graph of f. But still have nonzero probability away from the graph.

So rather than trying to approximate the support of Z, it would seem better to promote give a distribution supported on graph(f) and minimize something like the Wasserstein distance to Z.

Say you calculate the marginal distribution on X that comes from Z. Let $W_f$ be the distribution on graph(f) with the same marginal. Then compare $W_f$ to $Z$ and optimize over choices of $f$.

Wasserstein requires metric on the underlying space $X \times Y$, but often that is something you have. You could do KL divergence if you don't have that but I suspect that will be infinite for support reasons and maximally unhelpful. Like if you had $(1,2)$ and $(1,2.0001)$ which meant that $Z$ could not come from the graph of any function but only for silly error reasons.

Defining a learning task with measure theory

There are 2 best solutions below

Related Questions in PROBABILITY-THEORY

Related Questions in MEASURE-THEORY

Related Questions in STATISTICAL-INFERENCE

Trending Questions

Popular # Hahtags

Popular Questions