Is my formula for the CDF of negative binomial distribution right?

1.1k Views Asked by At

Due to the differences in notation for the formula of the CDF of negative binomial distribution from Wikipedia, ScienceDirect and Vose Software, I decide to rewrite it in the way that I can easily understand. We have:

$x$ number of trials, $x = \textrm{1, 2, ...}$

$r$ number of failures, $r = \textrm{1, 2, ... }x$

$k$ number of successes, $k = x - r = \textrm{0, 1, ... }$

$p$ probability of success, $0<p<1$

The formula for the CDF of binomial distribution is:

$$F_{X_k}(x)=P(X_k\leq x)=\sum_{i=1}^{x}{{k+i}\choose{i}}p^{k}{(1-p)}^{i}$$

Do I have it right? Many thanks!

2

There are 2 best solutions below

4
On

I think the best way we can think to a Pascal random variable $X_{n}$ with distribution $\mathcal{NB}(p,n)$ is to think at it as a sum of $Y_{1}+\cdots+Y_{n}$ of $n$ random independant with equal geometric distribution $\mathcal{G}(p)-1$.

Infact you can notice this considering $Y_{i}$ as the random variable which keeps tracks of the failures betweeen the $(i-1)$-th and $i$-th success. $Y_{i}$ are clearly independant and have distribution of parameter $p$ minus 1 because the geometric distribution counts the number of trial needed in order to gain a success, which correspond to the number of failures and the the final success.

In this way is easier to verify your distribution correctness. You can find some reference here

3
On

As in your question Correct formulas for the mean and variance of negative binomial distribution at CrossValidated, the different sites you link to seem to have different ideas of what $p$ is. The $p$ at ScienceDirect seems to be the probability of failure, whereas the $p$ at Wikipedia is the one of success. (The latter appears intuitively more appealing to me. R also uses the convention that the prob parameter to rnbinom() stands for the probability of success.)

Also as in the CV question and as Henry notes, you need to take care of whether you define your random variable as counting the number of trials or of failures until you see $k$ successes. Or even as the number of trials or of successes until you see $r$ failures.

Your random variable $X_k$ apparently counts the number of failures until you see $k$ successes. This is almost exactly the formulation used in Wikipedia, except that Wikipedia's formulation counts the number of successes until we see a certain number of failures. (Conversely, R does it your way.) We can account for this by exchanging the definition of "success" and "failure", which amounts to changing $p$ in the Wikipedia article to $(1-p)$. (Also note that Wikipedia's $r$ is your $k$.) The PMF is then

$$ P(X_k=i) = {i+k-1 \choose i}p^k (1-p)^i. $$

For the CDF, we have to sum from zero (we may see zero failures) to the $x$ you want:

$$ P(X_k\leq x) = \sum_{i=0}^x P(X_k=i) = \sum_{i=0}^x{i+k-1 \choose i}p^k (1-p)^i. $$

This looks a little different from your formula, both in terms of the summation (which needs to start from zero, as above) and of a different binomial coefficient. I can't quite reconstruct where this comes from, but a little simulation in R appears to vindicate the CDF I propose (bars are simulation results, the black line gives my CDF, the red line yours):

enter image description here

rm(list=ls())

pp <- 0.3   # probability of success
kk <- 2 # number of successes

nn <- 10000
failures <- rnbinom(nn,size=kk,prob=pp) # gives number of *failures*, see ?rnbinom

cdf_sk <- Vectorize(function(xx) {
    ii <- 0:xx
    sum(choose(ii+kk-1,ii)*pp^kk*(1-pp)^ii)}, vectorize.args="xx")

cdf_muxo <- Vectorize(function(xx) {
    ii <- 1:xx
    sum(choose(kk+ii,ii)*pp^kk*(1-pp)^ii)}, vectorize.args="xx")

plot(0:max(failures),cumsum(table(factor(failures,levels=0:max(failures))))/nn,
    type="h",xlab="",ylab="",las=1)
lines(0:max(failures), cdf_sk(0:max(failures)))
lines(0:max(failures), cdf_muxo(0:max(failures)),col="red")