Simulating probability based on percentage occurrence

86 Views Asked by At

I have a question similar to this one.

If for a group of 85 riders, 40 riders fall over 24 races, then the probability of a rider falling in any single race is:

$$ 1 - (1 - p)^{24} = \frac{40}{85} $$

Where $ p \approx 0.0262 $.

However, using this probability, if I run 24 simulations (races) of 85 binomial draws (riders) with $p$ probability (replicated 10k times), a fall occurs, on average, $ \approx 53 $ times, not the expected $ \approx 40 $ times.

Linking back to this question, however, using their example parameters, in which there are 98 riders, with 4 falling over 52 races. If I run 52 simulations (races) of 98 binomial draws (riders) based on the probability of a rider falling in their example ($p \approx 0.0008$) (replicated 10k times), a fall does occur, on average, $ \approx 4 $ times, as expected.

I'm really struggling to understand what about my interpretation of $p$ is creating this issue with the simulation.

Thanks

R code for simulations:

> mean(replicate(10000, sum(replicate(24, rbinom(85,1,0.0262)))))
[1] 53.4684

> mean(replicate(10000, sum(replicate(52, rbinom(98,1,0.0008)))))
[1] 4.0213
2

There are 2 best solutions below

0
On BEST ANSWER

The reason why you are not getting consistent results is because your simulation does not model what the parameter estimate $p$ actually represents. The original question is this:

Given a cohort of $n$ riders, each of whom ride some fixed number of races $r$, and each rider has an individual, independent, and identical probability $p$ of falling in any given single race, what is the number of distinct riders that fall over $r$ races? Conversely, given that $X \in \{0, 1, \ldots, n\}$ riders are observed to have fallen at least once over $r$ races, what is the estimate of the probability $p$ of a given rider falling in any single race?

As such, your simulation does not model this situation because yours counts falls, not riders. If a single rider falls multiple times over $r$ races, you count each instance of a fall by the same rider.

So the way you would properly simulate this scenario is to generate $n$ iid geometric random variables $G_1, \ldots, G_n$ with parameter $p$. $G_i$ represents the random number of races that rider $i$ will participate in until they experience their first fall. Under this model, the appropriate parametrization of $G_i$ is $$\Pr[G_i = g] = p (1-p)^{g-1}, \quad g \in \{1, 2, 3, \ldots \}.$$ Then count the number of $G_i$ that are less than or equal to $r$, meaning they fell at least once over $r$ races. This is your $X$, and you would simulate many observations of $X$. The quantity $X/n$ gives an estimated proportion of riders falling over $r$ races.

In your case, you have $n = 85$, $r = 24$, and you observed $X = 40$. Then your estimate of $p$ is $$\hat p = 1 - (1 - X/n)^{1/r} = 1 - (9/17)^{1/24} \approx 0.0261515.$$ When I simulate this in Mathematica using the code

Histogram[ParallelTable[Length[Select[
   1 + RandomVariate[GeometricDistribution[0.0261515], 85], # <= 24 &]], {10^5}]]

I get the following:

enter image description here

0
On

Let $X_{i,j}$ denote the indicator of $i$'s failure in $j$'s race. I assume that $X_{i,j}$ are i.i.d. across $i$ and $j$. You want to estimate $p=\mathsf{E}X_{i,j}$ using $S:=\sum_{i}1\{\sum_jX_{i,j}>0\}$. First, $$ \mathsf{E}[S]=\sum_i\mathsf{P}\!\left(\sum_{j}X_{i,j}>0\right)=n(1-(1-p)^m), $$ where $n$ is the total number of riders and $m$ is the total number of races. A natural estimator of $p$ in this case is $$ \hat{p}=1-(1-\hat{\mathsf{E}}[S]/n)^{1/m}. $$ Therefore, the precision of $\hat{p}$ is directly related to the precision of $\hat{\mathsf{E}}[S]$. If you observe only one realization of $S$ as in your case (40), the resulted estimate of $p$ can be very far from its actual value. For example, you can get $\hat{p}=0$ or $1$ (when $S=0$ or $n$) with positive probability.

If you repeat the same experiment many times, say $N$, and use $$ \hat{\mathsf{E}}[S]=\frac{1}{N}\sum_{k=1}^N S_k, $$ where $S_k$ is the result of the $k$-th experiment, you get more and more precise estimates as $N$ increases. Specifically, by the law of large numbers $\hat{p}\to p$ a.s. as $N\to\infty$.