I have clusters of a given length (minimum length = 3). I would like to distribute the points from each cluster given three percentages (e.g. train = 70%, test = 20%, validation = 10%). I would like to provide each set at least one datapoint and end up with three numbers which resemble as closely as possible the percentages stated.
So, for instance:
cluster length = 3: train: 1 test: 1 validation:1
cluster length = 7: train: 4 (floor) test: 2 validation: 1
cluster length= 10: train: 7 test: 2 validation: 1
I was wondering if there is a smooth way of doing it.
Since the validation set is the smallest, start by determining the size of that. If you have $n$ points, then $$|\text{Validation set}| = \max\{1, \text{int}(0.1\cdot n)\}$$ where $\text{int}(x)$ is $x$ rounded to the nearest integer. Next, $$|\text{Test set}| = \max\{1,\text{int}(0.3\cdot n)\}$$ and finally $$|\text{Training set}| = n-|\text{Validation set}|-|\text{Test set}|.$$