k-fold cross validation

444 Views Asked by At

I have a question regarding the (k-fold cross validation). I understand the process in general but I am not certain why we test on all data except Sj (i.e. all except one). My understanding is that a single subsample is kept as validation data for testing the model, and the remaining k − 1 subsamples are used as training data.

Why do we need to keep one sample as validation data, maybe the real question is why we cannot use the subsample for training data and still have it as a validation data. Is it not unchanged although we use it for training?

Thanks for all clarification

2

There are 2 best solutions below

0
On

The main principle is that models ability to generalize is tested on the previously unseen data. It does not make any sense to test the model on the data that has been used during training. Including a test fold into the training set makes a great difference, because the model adjusts its parameters to minimize the difference between its prediction and the ground truth.

In k-fold cross validation the data set is split into k folds and then k experiments are performed. Each fold in turn is used as a test set and other k-1 folds as a training set.

0
On

In the train/cross-validate/test paradigm, your aim is to choose and tune your best model and then train it; one of your objectives is to avoid over-fitting.

You start by separating a test set from the data, which you will use once at the very end to judge the performance of your model on unseen data.

With the remaining data (the training data), you can choose your model and tune its hyperparameters using cross-validation. With $k$-fold cross validation, you run through possible models and hyperparameters $k$ times each, training on the $\frac{k-1}{k}$ of the training data and validating on the $\frac1k$ of the training data in a particular fold; combining these results is designed to indicate which model and hyperparameters may perform best. Once you have done that, you have a chosen model and hyperparameters, and you can train it using the full set of training data to produce your final model.

So your statement "we test on all data except Sj" does not look correct: instead, in the validation phase we train on all folds except one and then do it again on all the folds except another one, and so on for all $k$ folds. This means that at this stage we do reuse the validation data as training data with other folds. There is still a real risk of some overfitting because of the multiple use of the same data and basing decisions on the model on this multiple use.

In the final testing phase, we train on all the training data and then test the final model once on the initially held-out test data to judge performance. The test data is not used for training and the model is not revised based on the results of the testing, since that would bring back the risk of overfitting and we would no longer be able to judge whether that was happening.