Do we factor in sample size in choice of number of folds in cross-validation?

36 Views Asked by At

My understanding is that there is no formal theory/rule on how to choose the number of folds (k) in k-fold Cross-Validation. I have read that a good choice lies between 15-20 percent of the data in the training dataset (i.e. one should generally use 5-fold Cross-validation, or, 10fold Cross Validation). But, do the sample size in some way affect/determine how to choice an appropriate k?

1

There are 1 best solutions below

0
On

It may be better to post this in Data Science or Cross-Validated.

But it is difficult to generalize a rule that will work for every ML model as it really depends on the type of estimator as well as the sample size. With $k$-fold cross-validation, essentially the model is trained on the ${1,..,k-1}$ folds while being validated against the $k$th fold during each run. Some ML algorithms are more robust to smaller datasets, while others (i.e. Deep Learning) require large amounts of data for their convergence properties.

For a rough rule of thumb, I suggest you set $k$ to be no less than the amount of data necessary for the ML algorithm to converge and achieve the desired performance. Setting $k=2$ for instance, will train the model on only 50% of the data during each run before averaging. Setting $k=N$, where $N$ is the sample size is equivalent to training the model on the full dataset besides a single point (leave-one-out). This can help if large amounts of data are necessary for convergence, but can also be computationally expensive so there is a trade-off here.

Now that you have bounds for the appropriate $k\in[2,N]$, you could first establish a baseline performance metric without any cross-validation. Then either empirically repeat the training process over all candidate values of $k$ or theoretically tighten the bounds by considering the ML algorithm.