Rigorously, what is the goal of (machine/statistical) Learning and why is that the goal?

242 Views Asked by At

After some time doing machine learning and statistical learning theory, I decided to return to my foundations and make sure that the goal of what I am doing makes sense.

First let me define $I(f)$ as the generalization error $I(f) = \int_{x,y} V(f(x),y) d\rho(x,y)$ and $V(f(x),y)$ as the cost function. Also, let $f_S$ be the output of a learning algorithm $A$ given some data, i.e. $f_S = A(S)$.

One way to characterize the goal of learning is as follows:

$$\inf_{f \in F} I(f)$$

given only a finite training set $S$ sampled from the distribution that generated the data $\rho(x,y)$. Where $F$ is the space of functions where the integral is well defined (things are measurable and good etc etc, not necessarily Hilbert spaces). Technically, if the inf could be achieved, then that would be the best of the best. The optimum of the optimum. Possibly not feasible to compute but mathematically it makes sense. Another way of saying this is, minimizing the expected error given only the training data is the goal of learning.

Since we probably can't achieve the above infimum (minimization), we need a way to quantify the quality of our solution $f_S$.

One possibility could be:

$$\inf_{f_s \in \mathcal{H}} \left\{ I(f_S) - \inf_{f \in F} I(f) \right\} $$

i.e. trying to have a $f_S$ as close to the best as possible.

However, I was considering a different way of characterizing what exactly it means to learn and I was wondering if it made sense or if they were equivalent or maybe one is stronger than the other.

Consider minimizing the infinity norm between the algorithms output function $f_S = A(S)$ (where $A$ is the learning algorithm and $S$ is the training set) and the best function $f^* = \operatorname{arginf}\limits_{f \in F} I(f)$, i.e. consider:

$$ \inf_{f \in \mathcal{H}} \left\{ \sup_{x \in \mathcal{X}} |f_S(x) - f^*(x)| \right\}$$

would trying to minimize $\sup_{x \in \mathcal{X}} |f_S(x) - f^*(x)|$ make sense as a goal for learning?

I was told that, that goal was "too strong". What does it mean to strong? Is it in a mathematical sense? Like, minimizing the above implies minimizing $I(f_S) - \inf_{f \in F} I(f)$?

Intuitively it seems to me that indeed the above implies minimizing $I(f_S) - \inf_{f \in F} I(f)$. Why? Because $\inf_{f \in \mathcal{H}} \left\{ \sup_{x \in \mathcal{X}} |f_S(x) - f^*(x)| \right\}$ means to minimize the difference between $f_S$ (output of learning algorithm) and the optimum $f^*$. This means that if the largest difference is minimized for all $x$, then $f_S$ and $f^*$ are basically the same (or approximately the same). Hence, they should have the same average cost over the whole data space. i.e.

$$I(f_S) - \inf_{f \in F} I(f) \approx 0$$

Maybe is way to strong and we only need them to be similar for data samples $(x,y)$ that are actually possible to sample, so maybe:

$$ \inf_{f \in \mathcal{H}} \left\{ \sup_{x \in \operatorname{support}(\rho)} |f_S(x) - f^*(x)| \right\}$$

Would this make more sense? Or are both definitions still unreasonable quantities to try to optimize in learning?

Why would minimizing:

$$ I(f_S) - \inf_{f \in F} I(f)$$

be a much better goal?

Also, after thinking about it more, maybe we would like to take into account the relative weights or relative frequencies of what points are sampled and instead redefine the quantity I suggested and minimize it:

$$ \int_{x} |f_S(x) - f^*(x) | d\rho(x)$$

maybe choosing an $f_S$ that achieves the above minimum is a good choice. However, I've never even seen this discussed anywhere. Is the above a bad quantity to consider?

What is the difference with minimizing $I(f_S) - \inf_{f \in F} I(f)$ and $ \int_{x,y} |f_S(x) - f^*(x) | d\rho(x,y)$? What are the pros and cons of one and the other? Why is $I(f_S) - \inf_{f \in F} I(f)$ is probably better?