Problem Statement
I'm interested in the mean number of samples from a given uniform distribution of integers I would need to take to reach a desired total or more.
For example, a six-sided dice is a uniform distribution of integers from $1$ through to $6$. The number of times you have to roll a six-sided dice ($r$) to achieve a desired total or greater ($t$) produces a distribution. For example, here is a graph produced by sampling dice rolls, aiming for a total of $80$ per run:
Graph of six-sided dice rolls with a target of $80$
The mean of this distribution is approximately $23.332448$. This is the number I am interested in deriving. Simply 23 would be too imprecise for what I need. Right now I am determining this by simply sampling large numbers of dice rolls and then calculating the average of that data set, however, I would like to instead understand the statistics involved and compute the mean, or a reasonable approximation of the mean.
One might try to derive the mean of the number of rolls needed $P(r=t)=0.5$ by simply using the mean of the uniform distribution, and dividing the total $t$ by that mean:
$$\frac{t}{(max-min)/2+min} = \frac{80}{(6-1)/2+1} \approx 22.857$$
I will call this the 'naive' approximation. As you can see, this result can be significantly off the desired result. Sometimes, it happens to be close to the right answer, however, I am interested in many cases where it is unacceptably imprecise.
Ultimately, I want to have a method or formula that takes an arbitrary continuous uniform distribution of integers (such as a die) and the desired total, which tells me the mean number of samples from that distribution I would need to take in order to reach that total.
Prior Research
I've looked at a few posts that talk about this subject, however, none of them were able to provide the complete answer to me. Here are two of the things I've found and read:
A Similar Question on This Forum
Number of dice rolls taken to reach a certain sum
This was useful in creating an approximation of the actual distribution I'm trying to find. I used this to turn the following recursive function into the following code (in rust, dependency on the cached crate for memoizing the recursive results to improve performance to something usable):
$$P(S_t=s) = \sum_{i=1}^6P(S_{t-1} = s-i)/6$$
With the following terminators:
- $P(S_0=0) = 1$
- $P(S_0<0) = P(S_{t>0}=0) = 0$
#[cached]
pub fn p(t: i32, s: i32) -> f64 {
if s <= 0 && t <= 0 {
return 1.0;
}
if s <= 0 || t <= 0 {
return 0.0;
}
(1..(std::cmp::min(6, s) + 1)).map(|i| p(t - 1, s - i)).sum::<f64>() / 6.0
}
Which, when run, can be used to create a graph comparing the frequency of the sampled results to the frequency predicted by that algorithm. In the graph, the blue line is the result of the algorithm, while the orange line is sampled data from a simple PRNG using a uniform distribution (the same data as in the first graph).
Note that to get this graph, you have to divide the results of this algorithm by the inverse of the mean of the uniform distribution (so $3.5^{-1}$ in this example) The author of the original answer suggests dividing by $P(S_t=36)$ (where $36$ was their target) however fails to realise that this is just an approximation of the inverse of the mean. I'm not actually sure why the mean of the uniform pops out here, so that's an interesting tidbit.
This looks pretty good, however, it does not provide the mean number of rolls as best I can tell. The author of the answer to the linked question states that the mean in their example is "about $10.5238577$", however, they don't state how they derived that number and I haven't been able to figure it out.
It's also not generalised to any uniform, however, I don't believe it would be difficult to generalise. I don't anticipate this to be an obstacle if someone finds out how to derive the mean from that answer.
A Similar Question on Reddit
On average, how many rolls of a fair die would it take to get a cumulative score above n?
This was useful in creating a different approach to solving this problem, where I iterated $r$ (the number of rolls) upwards from 1, each time producing an approximation of the distribution of the combined rolls using a normal distribution. The first normal distribution to have a cumulative distribution function for the given target of less than $0.5$ indicated when the integer mean number of rolls was being made. Again, however, like the previous solution, this only produced the integer number of rolls. I need to know the mean number of rolls more precisely than this.
There are a number of other answers than the top up-voted answer on this Reddit post with differing amounts of clear information. The answer from the user u/possiblywrong shows an interesting cumulative distribution function which I wonder if it would solve my problem. However, I am unable to figure out how to reproduce their ideas as code or fully understand them. For completeness sake, here is an image of that response in case it is deleted in the future.
Thanks
Thank you for any time you spend on this, this is doing my head in. It feels like it should be so easy to model, and yet I've lost an entire day researching it only to find vague answers that are either too complicated for me, or are simply missing information that I need to understand them. Hopefully, you can help.
Let $f_{s}(t)$ be the expected number of rolls to get a sum greater than or equal to $t$ with an $s$-sided die. Then you have the recurrence $$f_s(t) = 1+\frac{1}{s}\sum_{k=1}^s f_s(t-k)$$ with the base cases of $f_s(t) = 0$ with $t \le 0$.
The solution to this (ignoring the base cases) $$f_s(t) = \frac{2}{1+s}t+\sum_{n=1}^{s}c_nr_n^t$$
where $r_n$ goes over the distinct roots of $sr^{s+1}-(s+1)r^s+1$. Alternately, since $r=1$ is a double root, this becomes $$f_s(t) = \frac{2}{1+s}t+c_1+\sum_{n=2}^{s}c_nr_n^t$$
where $r_n$ instead goes over the roots of $$\sum_{m=0}^{s-1}(m+1)r^m = 1+2r+3r^2+...+sr^{s-1} \tag 1$$
To find the exact values of $c_1, ..., c_n$, you would need to solve the equations $f_s(t) = 0$ for $t = 1-s, 2-s, ..., 0$.
Alternately, you could just build up from $t = 0$ (using a computer) to find that $$f_6(80) = \frac{694905206512232068400374286878415315634626408538515111020013047}{29781651707669509088572079548239633038047628833600290523447296} \approx 23.333$$
If you don't need the exact value, since the magnitude of each root of $(1)$ is less than $1$, an approximation would be $f_s(t) \approx \frac{2}{1+s}t+c_1 = \frac{2}{1+s}t+O(1)$