It makes sense to me that a randomly sampled subset of a dataset should still be theoretically representative of its parent. When you take data and split into training and test sets, you assume that the training and test sets are representative of the dataset as a whole, at least in terms of its target distribution in the case of ML. But is there some minimum sample size that ceases to be representative?
Say I have a set of data $D$ and take a subset of that dataset $d \subset D$. Suppose the dataset $D$ consists of $(x_i,y_i)$ pairs of input and target. How many points will need to be in $d$ before the target distribution of $d$ mirrors that of $D$?
As per a comment, to clarify what I mean by being representative..
I claim that you can tell if a subset $d$ is representative of $D$ if for each label the ratio of its value counts to the size of the dataset is preserved from $D$ to $d$.
For example, the ratio of label $a$'s frequency in $D$ to the size of $D$, which I'll denote as $\tilde D_a$ is
$$\tilde D_a = \frac{f_a}{len(D)}$$
Where $f_a$ is the frequency of label $a$ in $D$. I argue that $d$ is representative to $D$ if:
$$\tilde d_i = \tilde D_i \ \forall i \in l$$
Where $l$ are the labels for the target variables in $D$.
It theoretically means if I plotted a histogram of the target/class distribution for $d$, it should have the same shape as $D$.