A story currently in the U.S. news is that an organization has (in)conveniently had several specific hard disk drives fail within the same short period of time. The question is what is the likelihood that this would happen? I would imagine that it can be determined quantitatively to be very unlikely, and I would like to know if a simple analysis is sufficient to come to that conclusion, or is it necessarily more complicated.
We can start with some assumptions, all of which can be challenged.
- A hard drive fails with an exponential probability density function (pdf) $p(t)= \lambda e^{- \lambda t}$, where $\lambda$ is the reciprocal of the MTBF (mean time between failures).
- All hard drives have an MTBF of 500,000 hours and operate under typical conditions.
- The hard drive failures are independent. (They run on different computers.There is no systematic relationship that would result in a dependency between these failures and any other commonly shared event or condition.)
- The failures are physical (not systematic, such as software induced)
- The organization operates 100,000 hard drives like these simultaneously.
While investigators argue about who has the better credentials relevant to the question, I believe it's amenable to a straightforward analysis, as follows:
An upper bound on $P_{1}(T)$, the probability of a single failure in time T, can be calculated from integrating the pdf on the interval $[0, T]$, where the pdf is maximum. The probability distribution $$P(t) = 1-e^{- \lambda t}$$ can be used to calculate $P(t=T)$.
The probability $P_{N}$ of N specific hard drives failing in that time interval is $P_{N}(T)=(P_{1}(T))^N$.
The facts in the actual investigation are not totally clear, but it appears that we are talking about 6 specific hard drives failing in a 1 week (168 hours). This leads to $$P_{1}(168)=1-e^{- 168/500,000}=3.36 \times 10^{-4}$$ and $$P_{6}(168)=1.44 \times 10^{-21}$$
This is so incredibly unlikely that I would try modifying my assumptions. First, if the time interval is 13 weeks, then $P_{6}(13*168)=6.85 \times 10^{-15}$. Still incredibly unlikely.
Even reducing the MTBF to 10,000 leaves us with $P_{6}(13*168)=5.7 \times 10^{-5}$ or nearly a one in a million chance.
One assumption that I didn't use was assumption number 5, that there are 100,000 hard drives within the organization. This is where the lies, damn lies and statistics creep in. But I think it's safe to say that this is irrelevant, given the other assumptions and that we are talking about specific hard drives.
Based on this analysis, calculating the probability that N specific hard drives would fail in an interval of time can be easily calculated. Have I made a mistake? Are there other factors that would have a significant effect on the result? If so, how?
Putting rough numbers in, with MTBF of $500,000$ hours, the chance a a given drive failing in a week is $\frac {168}{500,000}$. The average number that fail in a week is then $\frac {168\cdot 100,000}{500,000} \approx 34$. Presumably they have some backup process, which failed here, or we wouldn't hear about it. The issue is how many combinations of $6$ are there that will cause data loss. We would need to know how the backup system works.