How to know the probability of the number of failures units in a cluster with single unit failure probability

55 Views Asked by At

I have a situation at work, where we need to release a product. We found an issue that occurs randomly. After a thorough analysis, we were able to root cause of the issue. The fix for the issue is not controlled by our company but requires the whole world to change. The issue occurs on a node after 300+ tries, if we continue the reproduction process, I would say it roughly shows up 4 times for every 1000 unit tests on a single node.

The probability is 4/1000

The nodes are in a cluster, and the unit test that reproduces the issue during cluster operations. I want to know the probability of the issue occurring if there are 10 nodes in the cluster and the probability of the issue seen 1,2,3,4,5,6,7,8 times during the reproduction step on a 10-node cluster.

We also want to know the probability of seeing this issue for various cluster sizes.

Is using the binomial distribution the right way to find the probabilities?

$$ p(x|n,p) = \left(\frac{n!}{(x!(n-x)!)}\right) \cdot {p^x} \cdot (1-p)^{n-x} $$

In a 10 cluster setup the probability of seeing a single failure is 4%

$$ p(1|10, 0.004)=\left(\frac{10!}{1!(10-1)!)}\right)\cdot{0.004^x}\cdot{(1-0.004)^{n-x}} $$

Is this is the right approach ?

This will help us get some data to check how many customers will see the issue? if this product is released.