Approximate binomial distribution with binomial distribution with smaller n

163 Views Asked by At

Given a binomial distribution with parameters $n$ (number of trials) and $p$ (probability of success), is it possible to approximate it using a smaller $n$ by adjusting the size of $p$? [EDIT:] More specifically, what I have in mind as an approximation is that e.g. the probability of $k$ successes in an $n=500$ distribution with a specially chosen $p$ is close to the probability of $10k$ successes in an $n=5000$ distribution for a given $p$, or better yet, that the first probability is close to the sum (or average) of probabilities for $10k-5$ successes and $10k-4$ successes, and ..., and $10k+4$ successes.

What would I like to approximate? Well, the mean and variance, of course, and maybe skew and kurtosis, but what I'd really like is an overall similar probability mass function.

Context: I'm simulating changes in gene frequencies in a population of organisms using a Wright-Fisher model with natural selection, which is to say that I use a binomial distribution with $n$ = population size and $p$ = a value that may depend on fitness differences. I use the distribution to create a transition matrix from states in which the gene has frequency $k$ to states in which it frequency $k'$. Then starting from a state in which all probability is on one frequency (usually), one multiplies as many times as needed to calculate distributions over frequencies at future times. So what I really want is to approximate distributions at different times as the multiplication is iterated.

One phenomenon I'm interested in is how fast probabilities of fixation (i.e. of $k=0$ or $k=n$) become large, and how the difference between probabilities of these two states changes over time. $n$ has a big effect on this.

The reason that I want to use a smaller $n$ is simply that the calculations I'm doing (not described here) become too slow for $n>500$.

Maybe this can't be done, or must be done differently depending on the size of $n$ and $p$, or must be done differently for different time steps. Maybe the approximation of the transition matrix will involve using another distribution (Poisson, Gaussian, Beta) or a diffusion approximation? I'm open to all suggestions, but would prefer to end up with a transition matrix(s) that I can multiply.