Does anyone understand the paragraph below?
The paragraph comes from Cross-valiation explanation at wikipedia.
"It can be shown under mild assumptions that the expected value of the MSE for the training set is (n − p − 1)/(n + p + 1) < 1 times the expected value of the MSE for the validation set (the expected value is taken over the distribution of training sets)."
Thanks in advance.
When you fit a model on a training set, a simple measure of model fit is MSE, which is basically the average squared distance between the observed y values and the predicted y values from your model. It turns out that MSE is an optimistic measure of the predictive accuracy of your model if you were apply the model to a brand new dataset on which the model was not trained. The general purpose of cross-validation is to try to estimate the accuracy of your model when applied to a new dataset.
However, in the case of linear regression, it turns out that the amount by which the MSE from the training dataset underestimates the predictive accuracy on a new data set can be estimated to be
$\dfrac{n-p-1}{n+p+1}$ where n is the number of observations in the training dataset and p is the number of parameters estimated. Thus it is estimated in linear regression that
$ MSE_{validation}=\dfrac{n+p+1}{n-p-1} MSE_{train}$.
Therefore, there is no need to use cross validation in a linear regression to estimate how the model would perform on a validation dataset.