Derivation of the bias-variance tradeoff

492 Views Asked by At

I'm having trouble understanding the derivation of the bias-variance tradeoff which is also given in the article on the mean squared error.

Let some data be represented by the random variable $X$ with distribution $P$. $f$ maps a sample $x$ to its true label $f(x)$, and $\hat{f}$ is a model for $f$, mapping a sample $x$ to its predicted label $\hat{f}(x)$. Since $X$ is a random variable, so are $f$ and $\hat{f}$, because they are functions of $X$.

Now, the statement of the bias-variance tradeoff is that $$MSE(\hat{f},f)=Bias(\hat{f},f)+Var(\hat{f})$$

Since $Var_P(\hat{f})=E_P[(\hat{f}-E_P[\hat{f}])^2]$ I get $$MSE(\hat{f},f)=E_P[(\hat{f}-f)^2]=E_P\left[((\hat{f}-E_P[\hat{f}])+(E_P[\hat{f}]-f))^2\right]$$ $$=Var_P(\hat{f})+2E_P\left[(\hat{f}-E_P[\hat{f}])(E_P[\hat{f}]-f)\right]+E_P\left[(E_P[\hat{f}]-f)^2\right]$$

Now, on Wikipedia, this becomes

$$=Var_P(\hat{f})+(E_P[\hat{f}]-f)^2$$

because 'the expression $E_P[\hat{f}]-f$ is constant/deterministic' which means it can be moved out of the outer expectation and the middle term becomes $0$, and the last term becomes the squared bias.

I do not understand this. $f$ is a function of $X$, so if the outer expectation is taken over $P$, it is not a constant/deterministic. It's a random variable.

Concrete example: Let's define $P$ by $P(X=a)=\frac{1}{3}$ and $P(X=b)=\frac{2}{3}$. The true labels are given by $f(a)=1, f(b)=2$, and the predicted labels by $\hat{f}(a)=2$ and $\hat{f}(b)=3$.

Then $E_P[\hat{f}]=\frac{1}{3}\cdot 2+\frac{2}{3}\cdot 3=\frac{8}{3}$ and the middle term above becomes

$$2E_P\left[(\hat{f}-E_P[\hat{f}])(E_P[\hat{f}]-f)\right]$$$$=2\left(\frac{1}{3}(2-\frac{8}{3})(\frac{8}{3}-1)+\frac{2}{3}(3-\frac{8}{3})(\frac{8}{3}-2)\right)=-\frac{4}{9}$$

which in particular is $\neq 0$.

What am I missing here?

1

There are 1 best solutions below

4
On

There are two sources of randomness that you are conflating.

$f$ is some function that you are trying to learn. The function itself is deterministic. [However, if evaluated at a random quantity $X$, the evaluation $f(X)$ is random, which is what you seem to be considering. But the function $f$ itself is deterministic.]

$\hat{f}$ is a function learned from data $D = \{(x_1, y_1), \ldots, (x_n, y_n)\}$. We typically consider $D$ as a random quantity. Since $\hat{f}$ depends on $D$, the function $\hat{f}$ is random.


There are several variants of "MSE." The simplest one is the error when evaluated at a fixed test point $x$, which is what the Wikipedia page follows. $$\text{MSE}_x(\hat{f}, f) = E_D[(\hat{f}_D(x) - f(x))^2].$$ [I assume $\sigma^2 = 0$, i.e. the irreducible error is zero, since you are not considering it.] Note that in this expression, the only source of randomness is the data $D$. We evaluate the functions at a fixed/deterministic test point $x$. In particular, the quantity $\hat{f}_D(x)$ is random because the function $\hat{f}_D$ is random because it depends on $D$. Meanwhile, $f(x)$ is not random.

Now you can go through the derivation to obtain the decomposition $$E_D[(f(x) - \hat{f}(x))^2] =E_D[(\hat{f}_D(x) - E[\hat{f}_D(x)])^2] + (E[\hat{f}_D(x)] - f(x))^2.$$


Sometimes MSE is instead defined with respect to a random test point $X$ rather than a fixed one. This amounts to plugging in the random test point $X$ in place of $x$, and wrapping everything in a second expectation with respect to the randomness in $X$. So in this setting, there are two sources of randomness: the original data $D$ that was used to learn $\hat{f}_D$, and a separate test point $X$ used to evaluate the performance of $\hat{f}_D$.

$$E_X[E_D[(f(X) - \hat{f}(X))^2]] =E_X[E_D[(\hat{f}_D(X) - E[\hat{f}_D(X)])^2]] + E_X[(E[\hat{f}_D(X)] - f(X))^2]].$$

But the crux of the derivation of the bias-variance tradeoff does not require considering this extra layer of randomness, so it is better to first consider the fixed-$x$ scenario, and then add on this extra random-$X$ layer afterward.


Response to comment:

If your $P$ is as you describe ("where the samples come from on which $\hat{f}$ is trained") then yes my $D$ is your $P$. I haven't assumed that each of the the $(x_i, y_i)$ pairs come from the same distribution; I'm just assuming the entire dataset $D=\{(x_1, y_1) ,\ldots, (x_n, y_n)\}$ comes from some data-generating distribution.

The notion of MSE is with respect to the randomness in the training data, so $E_D$ (or $E_P$) cannot be discarded even when considering a fixed test point $x$. For a given training dataset of images $x_1, \ldots, x_n$, yes you have true labels $y_1, \ldots, y_n$. The randomness is not in the label, but rather in the entire dataset; imagine today you have one training dataset, but tomorrow you have a different training dataset of images, the day after you have another dataset, and so on. For each dataset, $\hat{f}$ is different, and the MSE, bias, and variance is referring to how $\hat{f}$ changes with respect to changes in the dataset.