Analyzing training error vs. empirical error

Question

Analyzing training error vs. empirical error

408 Views Asked by Bumbble Comm At 11 May 2026 - 7:15

Suppose I have a random variable $X$ with values in $\mathbb{R}^n$, and a function $\mathscr{L}:\mathbb{R}^n \to \mathbb{R}$. In practice $X$ could represent a distribution of data, and $\mathscr{L}$ could be a loss function associated with a training algorithm, for instance. I'm interested in analyzing the difference between the distribution of $\mathscr{L}(X)$ and the distribution that results from taking $n$ independent draws from $X$ and averaging the value of $\mathscr{L}$ over them. $$\mathscr{L}(X) \quad \text{vs} \quad \frac{1}{n} \sum_i \mathscr{L}(X_i) , \,\, \quad X_i \text{ are i.i.d. copies of }X$$ In particular I'd be interested in looking at the expectation of these two different things.

(For example, if $\mathscr{L}$ is the identity function, and we look at expectations, we're looking at the empirical mean vs. the true mean.)

Obviously nothing can be said at this level of generality. But I'm interested in reading about techniques that are relevant to such a problem.

(Cross posted on stats stack here)

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

Assuming I've understood your question, their expectations are the same: $$ \mathbb{E}\left[\frac{1}{n}\sum_i\mathcal{L}(X_i)\right] = \frac{1}{n}\sum_i \mathbb{E}[\mathcal{L}(X)] = \mathbb{E}[\mathcal{L}(X)] $$ by the linearity of expectation and $X_i \stackrel{d}{=} X$, with $X\sim\mathcal{D}$.

Essentially, in the IID case, this is not an interesting problem. The real issues start when some of these assumptions no longer hold.

What you probably want to look at is computational learning theory, specifically the probably approximately correct framework (see also here). This lets you prove things like sample complexity bounds, which (for a specific hypothesis space) tell you how many data points you need to guarantee that your learner can get error less than $\epsilon$ with probability greater than $1-\delta$. Clearly, if the IID assumption is violated, this will alter the sample complexity in the same way as correlation alters effective sample size. Some references: [1], [2], [3].

Another interesting direction is domain adaptation. In this case, your training and test sets ($X$ and $T$ resp) are from different distributions over the data space (e.g. training on poodles vs house-cats, but testing on chihuahuas vs tigers), i.e $X\sim D_1, T\sim D_2$. Then your loss function will depend on the difference between the distributions $D_1$ and $D_2$! A reference for this is Sun et al, A survey of multi-source domain adaptation.

Analyzing training error vs. empirical error

There are 1 best solutions below

Related Questions in PROBABILITY

Related Questions in STATISTICS

Related Questions in REFERENCE-REQUEST

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions