Training and validation set

67 Views Asked by At

If I have a dataset of 1300 samples, is it a fair proportion to split it into 1100 for training and 200 for validation? Will I incur some sort of bias if I reduce even more the validation set? (I've noticed that with 1120/180 I get a better accuracy on the validation set, but the proportion of validation to total shrinks to just 13%!)

1

There are 1 best solutions below

0
On BEST ANSWER

The size of the validation set is pretty much up to you.

It is a tradeoff:

  • As the validation set shrinks, you get a more accurate picture of what your "final" classifier will look like if it used the whole training set, but you only get the measurement on a small number of validation points. Furthermore, the estimator will have high variance (measured across different potential datasets [see links below]). For instance, in $k$-fold CV, if $k$ is very high (e.g. for leave-one-out CV), then the training sets will not be independent and so the variance will increase (this is akin to the fact that the variance of an estimator that uses correlated data will decrease slower as the number of points increases, compared to one with independent data, e.g. see here or here).
  • On the other hand, if the validation set is huge, then your classifier won't really be like the one you want to use on the test set. In other words, you will get a biased estimate of the generalization error!

So, overall, the takeaway is that as you reduce the validation set, you are likely increasing the variance of the estimator, but not the bias (which increases as the training set shrinks). Only you can really decide where the balance lies (since no one here knows whether your data consists of 10 dimensional vectors or graphs with thousands of nodes each) but with only 1300 samples you could 10-fold CV or even higher for instance. Even with only 50-100 points you will get a reasonable estimate (though of course this depends on the situation).

Note that the validation set is completely different from the test set, but one generally has the same problems in terms of this tradeoff for the test set as well.

See this one, this one, this one, this one and this one, as well as their associated links for more info.