Training, validation, and test dataset and i.i.d. assumption

183 Views Asked by At

I wonder whether we should distinguish the validation and test dataset based on i.i.d. assumption.

According to the statistical learning theory, the i.i.d. assumption is required to affect generalization performance (the performance measure on the test set) when we can observe only the training set. Strictly speaking, we cannot observe the test set at least when we train a model (right?). However, we simply split the dataset into the training, validation and test set in practice. I think splitting dataset violates the independence in data-generating process.

For example, consider a data-generating process by Latin-hypercube sampling. enter image description here The left panel is the distribution of a dataset sampled by Latin-hypercube method, the center panel shows the randomly split training and validation datasets, and the right panel is the test set sampled by the same process as the training set. Even though the difference between the training and test set (or dataset and test set) seems insignificant, they are not identical. The sampling and splitting processes are dependent so that the resulting distributions are not identical to that of the test set.

I agree with that splitting datasets is practically useful. But I'm curious, is there any theoretical background supporting this practical usage?