How many times do you have to download a 100,000 word document if a random 5% of the document is deleted every time you download it?

56 Views Asked by At

If there is an online document with 100,000 completely unique words and every time you download it, 5% of the words are randomly deleted, how many times do you have to download it before you get all 100,000 words?

1

There are 1 best solutions below

2
On

Take a look at the first word. There is a 5% it is deleted the first time. The probability it is deleted on both the first and second download is $0.05^2$. The probability it is deleted on all downloads after $N$ downloads of the document is $0.05^N$. That is never equal to 0 no matter how big $N$ is. Thus, you cannot ever be certain that even the first word of the document is downloaded, let alone all the words in the document.

There is a formula in the answer to this question.

I implemented it two different ways in Mathematica. The first way is literally from the formula as written. The second way is an attempt to make it more numerically stable. I assume $n$ is large and$m$ is a fraction (between 0 and 1) of $n$ so that $m=f \times n$. Then, I take the log of the terms with the binomial coefficients because those numbers are huge in this case. Do the calculations on the log-scale and then exponentiate back to get the correct term. I check that both give almost the same answer in the case of that question where $m$ and $n$ are relatively small ($n=100$ and $m=10$, which means $f=0.1$). The first function won't even run with large numbers like this problem, i.e. $n=100000$.

P1[t_, n_, m_] := 
 Sum[(-1)^j Binomial[n, j] (Binomial[n - j, m]/Binomial[n, m])^t, {j, 
   0, n}]

P2[t_, n_, f_] := N[1 + Sum[(-1)^j 
     Exp[(j t Log[1 - f] + j Log[n] - LogGamma[1 + j]) + (-j + f j + 
        j^2 - f j^2 - f j t + f j^2 t)/(2 (-1 + f) n)], {j, 1, n}]]

Next, I found that for your question P2[8,100000,0.95] is 0.999996
P2[5,100000,0.95] is 0.969233
P2[4,100000,0.95] is 0.535181
P2[3,100000,0.95] is 0.00000356

That is, you will almost never download the whole document after 3 download attempts, there is a 53.5% of success after 4 download attempts, 96.9% chance after 5 download attempts.

Intuition: on average, each download captures 95% of the the missing words. So, after two downloads, you have only about 250 missing words. After 3 downloads, about 12 missing words. The probability that you capture all those 12 words in the fourth download is $0.95^{12} \approx 0.54$. If you don't capture all of them, you will most likely only have 1 or 2 missing now and you are practically guaranteed to catch them in the next one or two downloads.