Optimal way to choose a "prediction" for a random variable

50 Views Asked by At

Let $X$ be a random variable with distribution

$$\mathbb {P}(X=k) = p_k, \space k=1,2,\dots n$$ $$(\text{and } \mathbb {P}(X=x) = 0 \text{ otherwise}).$$

We must choose a multiset $A$ of values $a_j\in\{1,2,\dots, n\}$, $j=1,2,\dots, m$. Then we start to sample $X$ and get the outcomes $x_1, x_2, \dots$. Every time the outcome is one of our chosen values, we get to remove that value. Note: the value could be in $A$ multiple times, but only one instance can be removed at a time. The aim is to have all elements removed.

What is the optimal way to choose $A$? I mean optimal in the sense that we want to see (and remove) the elements of $A$ with as few samplings as possible. (I don't know, should we minimize the expectation of these or what quantity we should minimize?)

Is the best strategy to take $ mp_k$ $k$'s (round this to a natural number) and solve ties in the favor of higher $p_k$'s? For example, if $X$ has probabilities $[0.75, 0.25]$ for $[1,2]$, the multiset $A={1,1,1,2}$ gives an expectation of needed samplings about $5.7$ and this seems to be the best.