My question is based in genetics, but let's use beetles to make it more concrete.
There are 250,000 known species of beetles. Let's assume that a thorough survey of beetle species has been performed, and a catalog of the frequency of each species has been created. I would like to create a beetle repository with a single representative of each species. As there is time and cost required to collect and store these beetles, I am trying to optimize my beetle collection size.
To do this, I need to know many individuals I must randomly collect to reach various levels of species representation, assuming that any duplicates don't add to the representation? (I.e. if I collect 1000 individuals, I'll probably only get representation of a fraction of that, since some will be duplicates.)
Thoughts on solution so far:Let's index species $ 1 \le m \le 250000$
And define $p(m)$ as, conveniently, both the probability of collecting a given species $m$ and the percentage of the population it represents.
I want to define a function $r(n)$ to describe the expected value of the proportion of the population (of all species) represented in my repository, given $n$ samples.
If I collect 1 beetle, the calculation is pretty straightforward. I need to calculate the probability of getting each type of beetle times the proportion represented for that beetle. $$ r(1) = \sum_{m=1}^{250000}p(m)^2 $$
Things get more complicated as we increase the number of beetles we collect, however, because we need to calculate the probability of every single combination of species we might get.
$$ r(2) = \sum_{m_1=1}^{250000}\ \sum_{m_2=1}^{250000} p(m_1)p(m_2) \bigl( p(m_1)+p(m_2)\bigr) $$
$$ ... $$
$$ r(n) = \sum_{m_1=1}^{250000}\ ... \sum_{m_n=1}^{250000} \bigl(p(m_1) * ...*p(m_n)\bigr) \bigl( p(m_1)+...+p(m_n)\bigr) $$
But wait, that's not right!
Those sums ignore the fact that any duplicates shouldn't be counted twice.
There has to be a better way to think about this. Any guidance is appreciated!
You write that you want to define a function $r(n)$ to describe the fraction of all species represented in your repository, but there seem to be two things wrong with that. The first is that that fraction is a random variable but you write equations for it as if it were a non-random quantity. It seems that you actually intend $r(n)$ to be the expected value of that fraction. The second inconsistency is that the equations you write aren't actually for the expected value of the fraction of all species in your repository, but (more in keeping with the question title) for the expected value of the proportion of the population that those species represent. So I'll assume in the following that you intended for $r(n)$ to be defined as the expected value of the proportion of the population that your repository represents.
This can be directly calculated using the linearity of expectation. The probability for species $m$ to be represented in a sample of $k$ beetles is $1-(1-p_m)^k$, and if it's represented, it contributes $p_m$ to the proportion of the population that's represented, so by linearity of expectation the expected value of that proportion for a sample of $k$ beetles is
$$ \sum_m p_m\left(1-(1-p_m)^k\right)\;. $$