Bound on the amount of probability mass 'seen' from a sample.

36 Views Asked by At

I was thinking of the following puzzle out of the blue and I would like to know if it has already been studied (I would assume so due to the simplicity of the setup). Imagine you have some unknown discrete random variable $X$ with a support over $v_1,...,v_k$, you draw an i.i.d sample $S$ consisting of $X_1,...,X_m$; you would like to lower bound the amount of probability mass 'seen' (which we will denote by $L_{S}$) in your sample. Formally, $L_S = \mathbb{P}(\cup_{i=1}^m\{X = X_i\}) = \sum_{i\in [k]}\mathbb{P}(X=v_i)\mathbb{1}_{v_i \in S}$

E.G. imagine $X_i \sim \text{Bern}(p)$ then there are 3 cases for $m\geq 1$

  1. $X_i = 1 \ \forall i \in [m] \implies L_S = p\ $ (since $\mathbb{P}(X = 1) = p$)
  2. $X_i = 0 \ \forall i \in [m] \implies L_S = 1-p\ $ ($\mathbb{P}(X = 0) = 1-p$)
  3. $\exists i, j \in [m] : X_i = 0 \text{ and } X_j = 1 \implies L_S = 1\ $ ($\mathbb{P}(X = 1 \cup X = 0) = 1$)

We can calculate the probability of every such case without much effort and then derive a simple bound through markov's inequality that shows: $$ \mathbb{P}_{S\sim X^m}(L_S \geq \epsilon) \leq \frac{\mathbb{E}[L_S]}{\epsilon} $$

I would like to know the name of such a quantity, if it has been studied before, and in what contexts. I would ultimately like a PAC-style (not necessarily distribution-independent) bound of the form $$ \mathbb{P}(L_S\geq \epsilon) \geq 1-\delta $$ where the constants depend on $k$ and $m$