Let $ X ~ Bin(n, p) $ and call $ X_{1}, ... X_{i} $ the index of each success, with $ X_{1} $ being the index of the first success.
I would like to compute the expected value of $ X_{i} - X_{i-1} $, that is, how many trials there are between successess on average.
When trying to do this, I found that this was very similar to a Negative Binomial Distribution or to a repetition of Geometric Distribution r.vs. However, if I understand it correctly, the difference is that there is a fixed number of trials. When $ n $ is large, the value is equal to the expected value in a geometric distribution, but when $ n $ is small (for instance $ n = 20 $), the value is lower.
To make things more concrete, I built a simple simulation in Python (the example used is a sequence of biased coin tosses).
If $ n $ is set to a large number, then $ E(X_{i} - X_{i-1}) = 1/p$, which is the expected value of the Geometric Distribution. However, if $ n $ is set to a small number, such as 20, then $ E(X_{i} - X_{i-1}) $ is lower. For instance, if $ p = 0.2 $ and $ n = 20 $ I find an average value around 4 (instead of 5). Yet, the average number of successes is still equal to $ n*p $ (in my example, 4).
I re-coded my simulation to adapt to and add Jeremy's answer.
I also added Jeremy's expected average computation.
def experiment(n_tosses, p):
diffs = []
success_indices = []
for i in range(n_tosses):
toss = np.random.binomial(1,p)
if toss == 1:
success_indices.append(i)
if len(success_indices) > 1:
for i in range(1, len(success_indices)):
index = success_indices[i]
last_index = success_indices[i-1]
diff = index - last_index
diffs.append(diff)
return {"interval" : np.mean(diffs),
"coverage" : sum(diffs),
"successes" : len(success_indices)}
elif len(success_indices) == 1:
return {"interval" : 0,
"coverage" : 1,
"successes" : 1}
else:
return {"interval" : 0,
"coverage" : 0,
"successes" : 0}
def conduct_experiments(n_tosses, p, excl_zeros=False, n_exp=1000):
experiments = [experiment(n_tosses, p=p) for i in range(n_exp)]
successes = []
intervals = []
coverages = []
for e in experiments:
if (excl_zeros and e["successes"] > 1) or not excl_zeros:
successes.append(e["successes"])
intervals.append(e["interval"])
coverages.append(e["coverage"])
print "interval", np.mean(intervals)
print "successes", np.mean(successes)
print "coverage", np.mean(coverages)
return coverages
def expected_interval(n_tosses, p):
n = n_tosses
sum_successes = 0
for i in range(2, n+1):
probs = p**i * (1 - p)**(n-i)
sum_poss = 0
for k in range(i-1, n):
term = (n - k) * comb(k, i-1)
sum_poss += term
success_term = probs * sum_poss
sum_successes += success_term
return sum_successes
Looking at your code, you have biased your experiment against long strings of failures. For example, if you have a string of 20 failures (which given your parameters occurs slightly more than 1% of the time), you discard the trial. If you have a string with 1 failure and 19 successes (which occurs about 5% of the time), you record the trials but assign a mean value of 0.
To provide a fairer basis for experiment, you should probably run your trials until a certain number of successes is reached, regardless of the number of tosses required.
UPDATE: Based on OP's comment, I wish to update my answer:
I believe the reason that the expected value of the statistic is significantly lower for a smaller number of trials $n$ than for a large number is due to edge effect when there is 0 or 1 successes in the string of 20. With the parameters included ($n=20$, $p=0.2$) approximately 7% of the experiment repetitions will be in this case. If the experiment were run to a fixed number of successes, such a string of failures would create a long interval. However since the experiment is run to a fixed number of trials, this string of failures contributes 0 to the mean.
Obviously as $n$ increases, these effects become diminished in the average and the expectation approaches what is expected.
UPDATE: Now that I understand better, I think it is possible to calculate the expected value of the statistic $Y$, the mean of the interval times between successes in $n$ binomial trials with a probability $p$ of success. As above, let $i$ be the number of successes in any given experiment. From previous discussion, if $i=0,1$, we define $Y=0$.
Suppose $i \ge 2$. Then the mean interval time for any given experiment can be written as $$Y=\frac1{i-1}\sum_{j=1}^{i-1} (X_{j+1}-X_j) = \frac{1}{i-1}(X_i-X_1)$$ as the sum telescopes. So for any given experiment, we can calculate our statistic $Y$ knowing the number of successes $i$ and the time interval between the first and last success, which we define to be $k$. Note that $k$ necessarily satisfies $i-1 \le k \le n-1$.
For a fixed $i \ge 2$ and $k$, there are $$(n-k) {{k-1}\choose{i-2}}$$ outcomes of the experiment with $i$ successes and first-to-last interval $k$. To see this, note that once $k$ is chosen, there are $n-k$ possibilities for the pair $(X_1,X_i)$, namely $(1,k+1) \ldots (n-k,n)$. Once this interval is chosen, the remaining $i-2$ successes must be chosen from the $k-1$ trials strictly between $X_1$ and $X_i$, which can be done in ${{k-1}\choose{i-2}}$ ways.
For this fixed $i$ and $k$, each of these outcomes has probability $p^i(1-p)^{n-i}$ of occurring, and the statistic $Y$ has value $\frac{k}{i-1}$. Thus we can sum over the possible values of $i$ and $k$ to get our expected value
$$E(Y) = \sum_{i=2}^n \sum_{k=i-1}^{n-1} (n-k) {{k-1}\choose{i-2}} \frac{k}{i-1} p^i(1-p)^{n-i}$$
which has the easy simplification
$$E(Y) = \sum_{i=2}^n p^i(1-p)^{n-i} \sum_{k=i-1}^{n-1} (n-k) {{k}\choose{i-1}}$$
This can probably be simplified a little bit further, but I'm kinda tired :-)
UPDATE: One can also conduct this experiment by discarding results with less than two successful trials, rather than assigning $Y=0$ in those cases, creating a new random variable $Z$. To calculate the resulting expectation, the analysis is the same as above, except that contributions to the expectation from the 0 and 1 success cases need to be discarded (but they were 0 anyway), and the remaining probabilities need to be conditioned on the event that the number of successes is at least two. The expectation for this modified protocol can be obtained from the value $E(Y)$ via: $$E(Z) = \frac{E(Y)}{1-(1-p)^n-np(1-p)^{n-1}}$$