Coupon collectors derivation

227 Views Asked by At

Yesterday, a user posted the following derivation of the coupon collectors problem but unfortunately deleted his question:

There are $n$ distinct characters one could obtain from a card package, whereby each card package you buy (one such package contains exactly one card) contains each of those characters with probability $\frac{1}{n}$. Compute the expected number of card packages you need to buy so that you have each of the $n$ distinct characters at least once.

Let $\Omega$ be our sample space and $X\colon \Omega\to \mathbb{R}$ the random variable that counts the number of card packages we have to buy in order to obtain each of the $n$ distinct characters at least once. Moreover, let $X_i$ be an indicator variable with $$ X_i = \begin{cases} 1, \ \text{the $i$th character we obtain that is distinct to all others we have collected so far}\\ 0, \ \text{otherwise} \end{cases} $$ Now we find that $P(X_i) = 1-\frac{i-1}{n} = \frac{n-i+1}{n}$ since the probability that we get one of the $i-1$ characters that we have already is $i-\frac{1}{n}$. Moreover, it's clear that $\Omega= X_1 \uplus X_1^c$ where $X_1^c$ denotes the complementary event of $X_1$. Now I used the total expectation theorem to obtain \begin{align*} \mathbb{E}[X] = (\mathbb{E}[X|X_1] + 1)\cdot P(X_1) + (\mathbb{E}[X|X_1^{c}] + 1) \cdot P(X_1^{c}) .\end{align*} whereby the $+1$ denotes the fact that we have bought a card package. We notice that $\mathbb{E}[X|X_i^{c}] = \mathbb{E}[X]$ since we haven't obtained any new card. With this we obtain \begin{align*} &\mathbb{E}[X] = \mathbb{E}[X|X_1]\cdot P(X_1) + P(X_1) + \mathbb{E}[X]\cdot P(X_1^{c})+ P(X_1^{c}) \\[15pt] \iff &\mathbb{E}[X]\cdot (1-P(X_1^{c})) = \mathbb{E}[X|X_1]\cdot P(X_1) + P(X_1) + P(X_1^{c}) .\end{align*} Since $P(X_1^{c}) = 1-P(X_1)$ this becomes \begin{align*} \mathbb{E}[X] \cdot P(X_1) = \mathbb{E}[X|X_1] \cdot P(X_1) + 1 \iff \mathbb{E}[X] = \mathbb{E}[X|X_1] + \frac{1}{P(X_1)} = E[X|X_1] + \frac{n}{1} .\end{align*} We can repeat this procedure since $X_2 \uplus X_2^{c}$ is again a disjoint partition of our sample space so we obtain \begin{align*} \mathbb{E}[X] = \mathbb{E}[X|X_1, X_2] + \frac{n}{2} + \frac{n}{1} .\end{align*} I would now argue inductively that \begin{align*} \mathbb{E}[X] = \mathbb{E}[X|X_1, X_2, \ldots, X_{i}] + \sum_{k = 1}^{i} \frac{n}{i} .\end{align*} so in the end one has \begin{align*} \mathbb{E}[X] = \sum_{k = 1}^{n} \frac{n}{k} = n\sum_{k = 1}^{n}\frac{1}{k} \end{align*} since the "recursion" ends with $\mathbb{E}[X|X_1, \ldots, X_n] = 0$ the "recursion" ends.

My problem is the $+1$ he has in the application of the total expectation, which is intuitively clear to me but how would one explain this formally since the total expectation theorem actually would look like $$ \mathbb{E}[X] = \mathbb{E}[X|X_1] \cdot P(X_1) + \mathbb{E}[X|X_1^{c}] \cdot (1-P(X_1)) $$ without any $1$'s, so how could one make this more rigorous? How does one explain $\mathbb{E}[X] = \mathbb{E}[X|X_1^{c}]$ rigorously?

Edit: I think it suffices for the latter to just say that since $X_1^c$ means that we haven't obtained any new card this doesn't influence the expected value of $X$.

1

There are 1 best solutions below

0
On BEST ANSWER

The $+1$ in the application of total expectation is a contradiction. If you distribute out the terms in their expression, using the fact that $P(X_i^c)+P(X_i)=1$ and the correct total expectation formula, you get: \begin{align*} \mathbb{E}[X]&=(\mathbb{E}[X|X_1]+1)\cdot P(X_1)+(\mathbb{E}[X|X_1^c]+1)\cdot P(X_1^c) \\ &= \mathbb{E}[X|X_1]\cdot P(X_1)+P(X_1)+\mathbb{E}[X|X_1^c]\cdot P(X_1^c)+P(X_1^c) \\ &=\mathbb{E}[X|X_1]\cdot P(X_1)+\mathbb{E}[X|X_1^c]\cdot P(X_1^c)+P(X_1^c)+P(X_1)\\ &=\mathbb{E}[X]+1 \end{align*} Subtract $\mathbb{E}[X]$ and $0=1$, a contradiction. This contradiction invalidates the proof, regardless of the correctness of the solution. I believe the contradiction is motivated by the author attempting to keep track of time using events, which leads to numerous errors like the above. To keep track of time, you should use a variable or a sequence of variables, not events. Mathematical objects never experience time, they are always the same at all stages of a proof, regardless of intent. To correct the proof (and to make the $+1$ rigorous), we can just correct this timekeeping.

We will measure time in terms of a non-negative integer $j$ indicating how many (not necessarily distinct) cards we have drawn already, and for each $j$, let $N_j$ be the random variable denoting the number of cards that must still be drawn at time $j$ to reach the $n$ distinct cards, so that $N_0=X$ and generally $N_{j-1}=N_j+1$, or really we just have $N_j=X-j$. Additionally, for each $i\leq n$ we let $t_i$ be the random variables counting the number of cards drawn by the time we have $i$ distinct cards, using $t_0=0$. We will consider $X_i$ to be the event where the $i$'th card you draw is distinct from the previous $i-1$ cards, so we can also consider the $t_i$ sequence as being the unique increasing sequence (of random variables) for which the indicators all have $X_{t_i}=1$ across $i>0$. This improves timekeeping and corrects their proof method while retaining the original intent.

For convenience, in the following derivations we will let $j=t_{i-1}$ be the number of cards we have drawn after just receiving $i-1$ distinct cards (without overlaps). For the first correction, instead of having $P(X_i)=1-\frac{i-1}{n}$, we should actually have $P(X_{j+1})=1-\frac{i-1}{n}$. This is the case since $j=t_{i-1}$ is the exact time that we have $i-1$ distinct cards, so $X_{j+1}$ is the indicator just after that point in time, and will be satisfied exactly when we do not draw any of the $i-1$ duplicate cards. Now, their total expectation calculation can be corrected (and generalized) like so. \begin{align*} \mathbb{E}[N_j] &= \mathbb{E}[N_j|X_{j+1}]\cdot P(X_{j+1}) + \mathbb{E}[N_j|X_{j+1}^c]\cdot P(X_{j+1}^c) \\ &= (\mathbb{E}[N_{j+1}|X_{j+1}]+1)\cdot P(X_{j+1}) + (\mathbb{E}[N_{j+1}|X_{j+1}^c]+1)\cdot P(X_{j+1}^c) \\ &= \mathbb{E}[N_{j+1}|X_{j+1}]\cdot P(X_{j+1}) + \mathbb{E}[N_{j+1}|X_{j+1}^c]\cdot P(X_{j+1}^c) + 1 \\ \end{align*}

Here we use the explicit timekeeping. On the left we have the expectation written in terms of $N_{j}$, that is, the number of cards we still need to draw after having just drawn $i-1$ distinct cards, and on the right we have written the expectation in terms of $N_{j+1}$, that is, the number of cards we still need to draw after having drawn one more than that. The $+1$ is made rigorous since as we noted, $N_{j}=N_{j+1}+1$ by definition. Now we can indeed say that $\mathbb{E}[N_{j+1}|X_{j+1}^c] = \mathbb{E}[N_{j}]$. That is, if we fail to draw a new distinct card at time $j+1$, then the expected number of cards we are yet to draw is identical to the expectation before we drew the $(j+1)$'th card at all. Now however, we can also say that $\mathbb{E}[N_{j+1}|X_{j+1}]=\mathbb{E}[N_{t_i}]$. That is to say, since $j+1$ is the time just after having drawn a total of $i-1$ distinct cards, in the event that the $(j+1)$'th card is also distinct from the previous ones, we will have $t_i=j+1$, that is, $j+1$ is the time where we now have $i$ distinct cards, which is exactly $t_i$. Substituting everything in and solving, we now have the following. \begin{align*} \mathbb{E}[N_j]&= \mathbb{E}[N_{j+1}|X_{j+1}]\cdot P(X_{j+1}) + \mathbb{E}[N_{j+1}|X_{j+1}^c]\cdot P(X_{j+1}^c) + 1 \\ \iff \mathbb{E}[N_{t_{i-1}}]&= \mathbb{E}[N_{t_i}]\cdot P(X_{j+1}) + \mathbb{E}[N_{t_{i-1}}]\cdot (1-P(X_{j+1})) + 1 \\ \iff \mathbb{E}[N_{t_{i-1}}] &= \mathbb{E}[N_{t_i}] + \frac{1}{P(X_{j+1})} \\ \iff \mathbb{E}[N_{t_{i-1}}] - \mathbb{E}[N_{t_i}] &= \frac{n}{n+1-i} \end{align*}

Now, since we have $X=N_0=N_{t_0}$, and also $N_{t_n}=0$ since by that point we already have all $n$ distinct cards, we can add up these differences to calculate $X$ with a telescoping sum as follows. \begin{align*} X &= N_{t_0} = N_{t_n} + (N_{t_{n-1}}-N_{t_n}) + (N_{t_{n-2}}-N_{t_{n-1}}) + \cdots + (N_{t_0}-N_{t_1}) \\ \mathbb{E}[X] &= 0 + \sum_{i=1}^{n}\mathbb{E}[N_{t_{i-1}}-N_{t_i}] = \sum_{i=1}^{n}\frac{n}{n+1-i} \\ &= n\sum_{i=1}^{n}\frac{1}{i} \end{align*}

Which is the correct solution.