Prove that the k-fold cross validation error is an almost unbiased estimate of the risk

496 Views Asked by At

In the setting of linear regression:

The prediction error using k-fold cross-validation is given by : \begin{equation}\label{eq1} \begin{split} err_{k-fold} = \frac{1}{k} \sum_{i=1}^{k} \frac{1}{n/k}\sum_{j\in ind[i]}l(h_{D_{(-i)}}(x_j),y_j) \end{split} \end{equation}

Where $h_{D_{(-i)}}$ means that the model has been trained on all the fold except the ith one

I want to show that :

\begin{equation}\label{eq2} \begin{split} \mathop{\mathbb{E}}_{D\sim p}\left[ err_{k-fold}\right] = \mathop{\mathbb{E}}_{{D'\sim p} |{(x,y)\sim}p}\left[ (y-h_{D'}(x))^2. \right] \end{split} \end{equation}

Where $D$ is a dataset of size n and $D'$ is a dataset of size $n-\frac{n}{k}$

And then prove that the $err_{k-fold}$ is an almost unbiased estimate of the risk of $h_D$


My approach was to try to compute the expectation of $err_{k-fold}$ and use the fact that D and D' are sampled from the same distribution. but I am lost. Can you please help me?

Thanks

1

There are 1 best solutions below

2
On BEST ANSWER

By linearity of expectation, you can push the expectation into the sum $$E[\text{err}_k] = \frac{1}{k} \sum_{i=1}^k \frac{1}{n/k} \sum_{j \in \text{fold}_i} E[l(h_{D_{-i}}(x_j), y_j)].$$ The value of the inner expectation is precisely equal to $$\underset{\substack{D' \sim p \\ (x,y) \sim p}}{E}[l(h_{D'}(x), y)]$$ regardless of the value of $i$ and $j$. So the double sum is just an average of $n$ copies of the same quantity.