Defining a learning task with measure theory

62 Views Asked by At


I'm trying to use the formalism of measure theory to define a learning task. I'm trying to say that a classifier $f: X \rightarrow Y$ tries to approximate the joint probability distribution for $X$ and $Y$.
I don't really know how to define this joint probability distribution, but have the following idea:

Let $(\mathcal{X}, \mathcal{A}, \mu_x)$ and $(\mathcal{Y}, \mathcal{B}, \mu_y)$ be two probability spaces. We define for $\mathcal{X}$ a random variable $X:\mathcal{X} \rightarrow \mathbb{R}$ and for $\mathcal{Y}$ a random variable $Y:\mathcal{Y} \rightarrow \mathbb{R}$. The joint distribution could be defined as a measure on the product space $\Omega = \mathcal{X} \times \mathcal{Y}$: $\mu(\Omega)$. And therefore my classifier would be a function $f: \mathcal{X} \rightarrow \mathcal{Y}$.

Does that sound correct or did I completely misunderstood the whole thing ?

2

There are 2 best solutions below

0
On

More correct would be to say that $Z=(X,Y)$ is a random variable from the product space $(\Omega,\mu_x\otimes \mu_y)$ to $\mathbb R^2$. And it is not that $f$ is trying to approximate the joint probability distribution, but rather the graph of $f$ $$ \textrm{graph}(f):=\{(x,f(x))\colon x\in \mathbb R\} $$ which tries to approximate the support of $Z$.

0
On

Starting with @pre-kidney answer:

It seems likely that the support of Z will be everything. It would just be peaked along the graph of f. But still have nonzero probability away from the graph.

So rather than trying to approximate the support of Z, it would seem better to promote give a distribution supported on graph(f) and minimize something like the Wasserstein distance to Z.

Say you calculate the marginal distribution on X that comes from Z. Let $W_f$ be the distribution on graph(f) with the same marginal. Then compare $W_f$ to $Z$ and optimize over choices of $f$.

Wasserstein requires metric on the underlying space $X \times Y$, but often that is something you have. You could do KL divergence if you don't have that but I suspect that will be infinite for support reasons and maximally unhelpful. Like if you had $(1,2)$ and $(1,2.0001)$ which meant that $Z$ could not come from the graph of any function but only for silly error reasons.