If I have a dataset of 1300 samples, is it a fair proportion to split it into 1100 for training and 200 for validation? Will I incur some sort of bias if I reduce even more the validation set? (I've noticed that with 1120/180 I get a better accuracy on the validation set, but the proportion of validation to total shrinks to just 13%!)
2026-03-25 08:05:33.1774425933
Training and validation set
67 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail At
1
The size of the validation set is pretty much up to you.
It is a tradeoff:
So, overall, the takeaway is that as you reduce the validation set, you are likely increasing the variance of the estimator, but not the bias (which increases as the training set shrinks). Only you can really decide where the balance lies (since no one here knows whether your data consists of 10 dimensional vectors or graphs with thousands of nodes each) but with only 1300 samples you could 10-fold CV or even higher for instance. Even with only 50-100 points you will get a reasonable estimate (though of course this depends on the situation).
Note that the validation set is completely different from the test set, but one generally has the same problems in terms of this tradeoff for the test set as well.
See this one, this one, this one, this one and this one, as well as their associated links for more info.