Suppose I have a random variable $X$ with values in $\mathbb{R}^n$, and a function $\mathscr{L}:\mathbb{R}^n \to \mathbb{R}$. In practice $X$ could represent a distribution of data, and $\mathscr{L}$ could be a loss function associated with a training algorithm, for instance. I'm interested in analyzing the difference between the distribution of $\mathscr{L}(X)$ and the distribution that results from taking $n$ independent draws from $X$ and averaging the value of $\mathscr{L}$ over them. $$\mathscr{L}(X) \quad \text{vs} \quad \frac{1}{n} \sum_i \mathscr{L}(X_i) , \,\, \quad X_i \text{ are i.i.d. copies of }X$$ In particular I'd be interested in looking at the expectation of these two different things.
(For example, if $\mathscr{L}$ is the identity function, and we look at expectations, we're looking at the empirical mean vs. the true mean.)
Obviously nothing can be said at this level of generality. But I'm interested in reading about techniques that are relevant to such a problem.
(Cross posted on stats stack here)
Assuming I've understood your question, their expectations are the same: $$ \mathbb{E}\left[\frac{1}{n}\sum_i\mathcal{L}(X_i)\right] = \frac{1}{n}\sum_i \mathbb{E}[\mathcal{L}(X)] = \mathbb{E}[\mathcal{L}(X)] $$ by the linearity of expectation and $X_i \stackrel{d}{=} X$, with $X\sim\mathcal{D}$.
Essentially, in the IID case, this is not an interesting problem. The real issues start when some of these assumptions no longer hold.
What you probably want to look at is computational learning theory, specifically the probably approximately correct framework (see also here). This lets you prove things like sample complexity bounds, which (for a specific hypothesis space) tell you how many data points you need to guarantee that your learner can get error less than $\epsilon$ with probability greater than $1-\delta$. Clearly, if the IID assumption is violated, this will alter the sample complexity in the same way as correlation alters effective sample size. Some references: [1], [2], [3].
Another interesting direction is domain adaptation. In this case, your training and test sets ($X$ and $T$ resp) are from different distributions over the data space (e.g. training on poodles vs house-cats, but testing on chihuahuas vs tigers), i.e $X\sim D_1, T\sim D_2$. Then your loss function will depend on the difference between the distributions $D_1$ and $D_2$! A reference for this is Sun et al, A survey of multi-source domain adaptation.