How much datapoints must be in a subset of a dataset before the subset is representative of that parent dataset?

54 Views Asked by Bumbble Comm At 11 May 2026 - 9:47

It makes sense to me that a randomly sampled subset of a dataset should still be theoretically representative of its parent. When you take data and split into training and test sets, you assume that the training and test sets are representative of the dataset as a whole, at least in terms of its target distribution in the case of ML. But is there some minimum sample size that ceases to be representative?

Say I have a set of data $D$ and take a subset of that dataset $d \subset D$. Suppose the dataset $D$ consists of $(x_i,y_i)$ pairs of input and target. How many points will need to be in $d$ before the target distribution of $d$ mirrors that of $D$?

As per a comment, to clarify what I mean by being representative..

I claim that you can tell if a subset $d$ is representative of $D$ if for each label the ratio of its value counts to the size of the dataset is preserved from $D$ to $d$.

For example, the ratio of label $a$'s frequency in $D$ to the size of $D$, which I'll denote as $\tilde D_a$ is

$$\tilde D_a = \frac{f_a}{len(D)}$$

Where $f_a$ is the frequency of label $a$ in $D$. I argue that $d$ is representative to $D$ if:

$$\tilde d_i = \tilde D_i \ \forall i \in l$$

Where $l$ are the labels for the target variables in $D$.

It theoretically means if I plotted a histogram of the target/class distribution for $d$, it should have the same shape as $D$.

Original Q&A

How much datapoints must be in a subset of a dataset before the subset is representative of that parent dataset?

Related Questions in STATISTICS

Related Questions in MACHINE-LEARNING

Related Questions in DATA-STRUCTURE

Related Questions in SUFFICIENT-STATISTICS

Trending Questions

Popular # Hahtags

Popular Questions