Bias-variance decomposition

1k Views Asked by At

I found the following in The Elements of Statistical Learning.

Suppose we have 1000 training examples $x_i$ generated uniformly on $[-1,1]^p$. Assume that the true relationship between $X$ and $Y$ is $$Y = f(X) = e^{-8||X||^2}$$ without any measurement error. We use the 1-nearest-neighbour rule to predict $y_0$ at the test-point $x_0 = 0$. Denote the training set by $\mathcal T$. We can compute the expected prediction error at $x_0$ for our procedure, averaging over all such examples of size 1000. Since the problem is deterministic, this is the mean squared error (MSE) for estimating $f(0)$:

\begin{align}\text{MSE}(x_0) &= E_\mathcal T[f(x_0)-\hat y_0]^2\\ &= E_\mathcal T[\hat y_0 - E_\mathcal T(\hat y_0)]^2 + [E_\mathcal T(\hat y_0)-f(x_0)]^2\\ &= \mbox{Var}_\mathcal T(\hat y_0) + \mbox{Bias}^2(\hat y_0) \end{align}

I don't quite understand what they mean by the problem being deterministic. Also, how exactly do they get from the first line to the second line? I played around with the binomial formula but I can't seem to derive this.

1

There are 1 best solutions below

6
On BEST ANSWER

To answer the second request:

$$ \begin{align} \text{MSE}(x_0) &= E_\mathcal T[f(x_0)-\hat y_0]^2\\ &= E_\mathcal T[\hat y_0 - E_\mathcal T(\hat y_0) + E_\mathcal T(\hat y_0)-f(x_0)]^2\\ &= E_\mathcal T[\hat y_0 - E_\mathcal T(\hat y_0)]^2 + E_\mathcal T[E_\mathcal T(\hat y_0)-f(x_0)]^2 + 2E_\mathcal T([\hat y_0 - E_\mathcal T(\hat y_0)][E_\mathcal T(\hat y_0)-f(x_0)])\\ &= E_\mathcal T[\hat y_0 - E_\mathcal T(\hat y_0)]^2 + [E_\mathcal T(\hat y_0)-f(x_0)]^2\\ &= \mbox{Var}_\mathcal T(\hat y_0) + \mbox{Bias}^2(\hat y_0), \end{align} $$ where fourth equality is true since $E_\mathcal T([\hat y_0 - E_\mathcal T(\hat y_0)]) = E_\mathcal T(\hat y_0) - E_\mathcal T(\hat y_0) = 0$ and $E_\mathcal T[E_\mathcal T(\hat y_0)-f(x_0)]^2 =[E_\mathcal T(\hat y_0)-f(x_0)]^2$, since $f(x_0)$, $E_\mathcal T(\hat y_0) $ are constants.

To answer the first request: They are interested in the case where the random variable $X$ takes $x_0=0$ value with probability $1$, thus the problem is deterministic. Refer ://en.m.wikipedia.org/wiki/Degenerate_distribution for more, degenerate and deterministic terms are interchangeable.