Why does k-fold cross validation generate an MSE estimator that has higher bias, but lower variance then leave-one-out cross-validation?

278 Views Asked by At

Looks like the rationale behind the accepted answer of this post is incorrect.

Under leave one out cross validation(LOOCV), the variance of its MSE estimator is $$var [\frac{\Sigma_i x_i}{n}] = \frac{var[\Sigma_i x_i]}{n^2}$$ where $x_i$ is an estimate of MSE from one particular iteration.

I agree that LOOCV has a higher enumerator (b/c of the covariance terms), but the denominator is larger as well because there are essentially n estimates (greater than k estimates as in the k-fold case).

Given this, why does LOOCV still have higher variance in estimating MSE and why does it have lower bias?

(This is against intuition b/c increasing sample should decrease variance and leaves bias unchanged for $\hat\theta$ and $\hat{y}$)

1

There are 1 best solutions below

0
On BEST ANSWER

This is a complex topic, but here are some thoughts.

The point of CV is usually model selection, e.g. hyper-parameter optimization. Usually when we are talking about variance here, we mean the variance of the prediction wrt the training set changing, not wrt the estimates of the error for a given model. Empirically, it appears LOOCV is not ideal for this. (Also, it is too costly to use for deep learners in practice).

The intuition is this: given a model, evaluate it with LOOCV. You will get a set of error estimates that you can average. These estimates might be very close to each other (i.e. low variance across folds). Now completely change the dataset, and do it again. (e.g. for dataset $i$, $E_i = \frac{1}{n}\sum_j e_{ij}$). The idea is that the variance of the $E_i$'s will be high for LOOCV, i.e. $\mathbb{V}[E]$ will be high, because training data is the same every time (i.e. $e_{ij}$'s are highly correlated for fixed $i$, decreasing variance over the estimate of $E_i$ but increasing it for an estimator over datasets, e.g. written $E=\frac{1}{|D|}\sum_d E_d$ for a set of datasets $D$). When using e.g. 2-fold CV, one gets a good estimate of the generalization error because the two models tested are totally unrelated to each other in terms of training data (hence they have high "variance" between folds, since their results will be different, but low variance across datasets since they are already being trained on uncorrelated data; together they are more representative of the generalization error).

So, using LOOCV, we get a set of estimates $E_i$'s that are stable across folds (due to high training sample correlation between models), but unstable across datasets; using e.g. 2-fold CV on the other hand gives instability across folds (since the data seen by the model in each case are independent), but less variance across datasets, since they are already better estimates of the true generalization error.

Note also that CV with large training sets (as in LOOCV), then overfitting to the data is much easier than for e.g. 2-fold CV, since (if you are comparing models) the model that fits the data sets idiosyncracies will do better.

Note this is similar to the classic bias-variance tradeoff, which says that weaker models can be more accurately estimated in terms of performance, but cannot fully fit the data and hence are a biased estimator of the target function, whereas stronger models are more prone to overfitting, and hence have less bias in getting to the correct answer but more variance as the training set changes since they will fit to the noise.

Regarding the estimate itself, you are right in that it's not clear which of the effects balance out: the high correlation of the training data vs the number of samples vs the fact that each "test set" has only a single sample (and thus is a high variance estimator too). For a single data set, the variance of the estimator across folds will be smaller for LOOCV, but that is not what we are interested in I think.

Just to link to some good posts on this topic: [1], [2], [3], [4], [5].