Finding number n bernoulli trials (probability p) needed to satisfy k successes with probability j

294 Views Asked by At

Firstly sorry for the title. I know it's not a good title, but if I knew how to describe this problem in a more elegant way I likely wouldn't have it in the first place.

The problem is the following:

You have a Bernoulli trial with a known p probability of success. You have to figure out the smallest number of trials n such that you have at least a j% chance of obtaining at least k successes (ie; cumulative probability X≥k is greater than j).

j,k, and p are your known variables. You are attempting to find n as the smallest number satisfying the above condition.

Given that the binomial distribution is allergic to closed form solutions I do not expect the problem above to have one, but I'm hopeful I can find a better input estimate for my code, which then uses a stepper function (iterating through higher or lower values of n) to find the true answer. Currently I am using the (very) naive estimate of k/p as a starting value.

Despite lots and lots of googling, I haven't managed to come across this exact question before, mostly stuff relating to the quantile function, which seems to be subtly different.

2

There are 2 best solutions below

6
On

There does not appear to be a good "closed form" solution (depending on what you mean by "closed form"), and you'll need to do some numeric root-seeking at some level.

But you're right that iteration (stepping up and down) is not a great solution. If the software libraries you're using include a function to calculate the inverse CDF of the negative binomial (i.e. matlab has nbinv, python's scipy library has nbinom.ppf), you're done.

Otherwise, you can use bisection (or any other reasonable root-finding or sorted-searching method) on the CDF (which is fairly easy to evaluate) to find the answer you need.

I think any further details are more a software or programming question.

Do note that there are multiple variations and conventions for the negative binomial distribution. Any given function may be evaluating something different than what you expect.

More details

The negative binomial distribution answers "how many (total $n$/non-counting $k$) samples do I have to get before I hit $r$ (your $k$, unfortunately) samples that count. For your purposes, Wikipedia's second (or third, if you want to swap $p$ and $(1-p)$) "alternative formulation" is appropriate.

The minimum number is the first value on the CDF that is greater than or equal to your $j$%. The CDF is the regularized incomplete beta function $I_p(r, k+1) = I_p(r, n + 1 - r)$ evaluated at each $n \geq r$. This is a polynomial in $p$ for integer $r$ and $n$, but this is little help as you're working with $p$ (and $r$) fixed, and $n$ varying. I'm unaware of any good formula to get from the values of this function to $n$, though as previously noted there are certainly canned routines in some software packages.

2
On

Using the normal approximation to the binomial, the probability that $X\ge k$ when $X\sim \text{Bin}(n,p)$ is approximately $$ P(X\ge k)\approx 1-\Phi\left(\frac{k-\frac12-np}{\sqrt{np(1-p)}}\right) $$ Here, $\Phi$ is the cdf of the standard normal variable.

You want to choose $n$ so that $P(X\ge k)\ge j/100$. This is approximately true when $$ np+\Phi^{-1}(1-j/100)\cdot \sqrt{np(1-p)}-(k-\tfrac12)=0 $$ This is a quadratic equation in the variable $\sqrt n$. Let $z=\Phi^{-1}(1-j/100)$. We get $$ n\approx \left(\frac{-z\sqrt{p(1-p)}+\sqrt{z^2p(1-p)+4p(k-\tfrac12)^2}}{2p}\right)^2 $$ I have tested this, and I found that this approximation is very close to the true value of $n$, as long as $p$ is not too close to $0$ or $1$. In particular, this works even in the extreme case where $j\%$ is close to $0\%$ or $100\%$, contrary to Quiaochu Yuan's doubtful comments. :^)

For example, take $p=0.1, j=0.00001$, and $k=34000$. Plugging this into the above, and rounding down, you get $n\approx 331020$. This turns out to be a bit too large, and the actual smallest value of $n$ such that $P(\text{Bin}(n,p)\ge k)\ge j/100$ is $n=330987$. This is a relative error of less than $0.01\%$.


Below is my Python program for finding $n$ exactly. I do not use the approximate of all; it suffices to use bisection search on the binomial cdf.

from scipy.stats import binom

def smallest_n(p, k, j):
    """
    Returns the smallest n such that
    P(X >= k) >= j / 100, where
    X is Binomial(n, p).
    """
    def prob_X_ge_k(n):
        return 1 - binom.cdf(k - 1, n, p)
    return function_bisect(prob_X_ge_k, j/100)

def function_bisect(f, target):
    """
    f is an increasing function from the nonnegative integers to floats.
    target is a float.
    Returns the smallest x such that f(x) >= target.
    WARNING: This loops forever if there is no x for which f(x) >= target.
    """

    lower_lim = 0
    if f(lower_lim) >= target:
        return lower_lim
    
    upper_lim = 2 
    while f(upper_lim) < target:
        upper_lim *= 2

    while upper_lim - lower_lim >= 2:
        mid = (lower_lim + upper_lim) // 2
        if f(mid) >= target:
            upper_lim = mid
        elif f(mid) < target:
            lower_lim = mid
    return upper_lim

p = 0.1
j = 0.00001
k = 34000

print(smallest_n(p, k, j))