Expected number of files to find all viruses

122 Views Asked by At

So there are 12 files in total and 3 of them contain viruses. If a file with a virus is selected, it is removed and a new file is then selected. What is the expected number of files that need to be selected to get a virus free computer?

Progress

I thought it was an expected value problem. So summation $\sum X(\xi)p(\xi)$ but that doesn't make sense.

Is this correct? Let n be the number of viruses, then the result is $n(1/1 + .. +1/n)$ $$3(1/1 + 1/2 +... + 1/12) = 9.3096$$

2

There are 2 best solutions below

5
On

Assuming files cannot be chosen more than once, the probability the third virus is in the $n$th file is the probability the two viruses are in the first $n-1$ files and the $n$th file has a virus, which is $$ \frac1{13-n} \times \frac{{3 \choose 2}{9 \choose n-3}}{{12 \choose n-1}} $$ and so the expectation is $$\sum_{n=3}^{12} \frac{n}{13-n} \times \frac{{3 \choose 2}{9 \choose n-3}}{{12 \choose n-1}} $$ which is $9.75$. Since this is $\frac34 \times(12+1)$, there may be an easier way.

Added A more combinatorial approach:

Number the virus-free files $1,2,\ldots,9$.

The probability that virus-free file $i$ has not been chosen by the time all three viruses have been found is $\frac14$.

So the expected number of virus-free files which have not been chosen by the time all three viruses have been found is $\frac94$.

So the expected number of files which have been examined by the time all three viruses have been found is $12-\frac94 = 9.75$.

Generalising this, if there are $c$ clean (virus-free) files, $v$ files with viruses and $f=c+v$ total files, then the expected number is $f-\frac{c}{v+1}=\frac{v}{v+1}(f+1)$.

0
On

Let our random variable (X) be the number of files selected until a virus-free file is selected.

The clue to this problem is the algorithm condition A, which says that "If a file with a virus is selected, it is removed and a new file is then selected."

That is the condition required to select more than 1 file. So 1 file is selected: The probability that it is virus-free is 9/12. Our algorithm stops there (X = 1). What about 2 files? In order to select the second file, the first file must be infected, and the probability of getting an infected file is 3/12 (it is then detected, so our total # of files drops by 1). Then the probability of getting a virus-free file on the second selection is 9/11. So the total probability for X = 2 is 9/44 Three files (X=3): In order to select the third file, two files prior MUST be infected, so the probability of getting an infected file on the first try is again, 3/12 (then it is detected and removed). The probability of getting an infected file on the second selection is 2/11. Then, the probability that the third file is virus-free is 9/10. The total probability for X = 3 is 9/220

X = 4: Prob. of file 1 is infected: 3/12, file 2: 2/11, file 3: 1/10. The fourth file must be virus-free, and look what happens here. Since all the viruses have been found in a row, there are no more viruses left, so the probability of getting a virus-free file is 9/9 or 1. It is certain. The total probability of X=4 occurring is: 1/220

For X = 5 and greater, it becomes an impossibility to select the 5th file because the 4th file must be infected first to fulfill our condition. But there are no more infected files after 4...so the probability of X = 5 is 0. (3/12 * 2/11 * 1/10 * 0/9 * 9/9)

So plugging this into the weighted averages formula gets us: 1*(9/12) + 2(9/44) + 3(9/220 + 4(1/220) + 5(0) + 6(0) + ... 12(0) = 1.3

This is the expected number of files that must be selected in order to get a virus-free file. Think about it, you have files, a majority of which are virus-free. When you run the algorithm, it's going to stop after 1 file and sometimes at 2 files, because the probabilities of selecting 1 or 2 files is much much greater than selecting 3 or 4 files, because it is highly unlikely to select those 3 viruses in a row.