Recently, I thought of the following question: Suppose there are 5 candies in a bag - you choose two candies, and then put these two candies back in the bag (assume each candy has an equal probability of being selected). On average, how many times do you need to choose candies before you are guaranteed to have seen every candy at least once?
In a way, this problem kind of reminds me of the "Coupon Collector Problem" (https://en.wikipedia.org/wiki/Coupon_collector%27s_problem), but I am not sure how to solve this problem using the Coupon Collector framework. I thought of framing this problem as a Markov Chain:
- State 2 : You have observed 2 unique candies
- State 3: You have observed 3 unique candies
- State 4: You have observed 4 unique candies
- State 5: You have observed 5 unique candies (Absorbing State)
It took me a long time, but I think I was able to create a Transition Matrix for this problem :
A = matrix(
c(0.1, 0.6, 0.3, 0, 0,0.3, 0.6, 0.1, 0,0, 0.6, 0.4, 0,0,0, 1), # the data elements
nrow=4, # number of rows
ncol=4, # number of columns
byrow = TRUE)
[,1] [,2] [,3] [,4]
[1,] 0.1 0.6 0.3 0.0
[2,] 0.0 0.3 0.6 0.1
[3,] 0.0 0.0 0.6 0.4
[4,] 0.0 0.0 0.0 1.0
From here, I suppose I could use the Theory of Markov Chains and find out the expected number of transitions until you reach the Absorbing State - but it was quite difficult to correctly calculate the transition probabilities. I imagine that once the number of states (i.e. "candies") increase, it will become very difficult to calculate all these transition probabilities.
I was hoping for an easier way which would directly allow you to calculate the expected number of draws needed to observe "M" candies (at least once) with "N" draws and each draw of size "K" (e.g. M = 5, K = 2, N = ?) - provided you are given the probability of selecting any given candy (e.g. suppose the candies did not have equal probabilities of being selected).
Can someone please suggest another way of solving this problem?
Thanks!
- "Food" for Thought: Suppose there were "M" candies" and you draw "K" candies "N" number of times. Suppose this time, you don't know the true value of "M" and you only have information on "K" and "N" - is there a way to estimate "M" based on the data you collect from "K" and "N"?
The transition probabilities come from the Hypergeometric distribution. You have population size $M$, you draw $K$ of these without replacement. When you've seen $i$ candies (that is you're in state $i$), the number of successes (unseen candies) in population is $M-i$ and the probability of observing $k$ new candies is
$$\frac{{M-i\choose k}{i\choose K-k}}{M \choose K}$$
For $M=5$ you get (I include the states 0 and 1)
$$ \displaystyle \frac{1}{\binom{5}{K}} \left(\begin{array}{rrrrrr} \binom{0}{K} & 5 \, \binom{0}{K - 1} & 10 \, \binom{0}{K - 2} & 10 \, \binom{0}{K - 3} & 5 \, \binom{0}{K - 4} & \binom{0}{K - 5} \\ 0 & \binom{1}{K} & 4 \, \binom{1}{K - 1} & 6 \, \binom{1}{K - 2} & 4 \, \binom{1}{K - 3} & \binom{1}{K - 4} \\ 0 & 0 & \binom{2}{K} & 3 \, \binom{2}{K - 1} & 3 \, \binom{2}{K - 2} & \binom{2}{K - 3} \\ 0 & 0 & 0 & \binom{3}{K} & 2 \, \binom{3}{K - 1} & \binom{3}{K - 2} \\ 0 & 0 & 0 & 0 & \binom{4}{K} & \binom{4}{K - 1} \\ 0 & 0 & 0 & 0 & 0 & \binom{5}{K} \end{array}\right) $$
which agrees with your result if you plug in $K=2$:
$$ \displaystyle \left(\begin{array}{rrrrrr} 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & \frac{2}{5} & \frac{3}{5} & 0 & 0 \\ 0 & 0 & \frac{1}{10} & \frac{3}{5} & \frac{3}{10} & 0 \\ 0 & 0 & 0 & \frac{3}{10} & \frac{3}{5} & \frac{1}{10} \\ 0 & 0 & 0 & 0 & \frac{3}{5} & \frac{2}{5} \\ 0 & 0 & 0 & 0 & 0 & 1 \end{array}\right) $$
Here's a Sage-code I used (I keep $K$ as variable, but the formula for the expected value of number of transitions before being absorbed becomes cumbersome, so there I plug in a value for $K$):