Method of generating random numbers that sum to 100 - is this truly random?

42.6k Views Asked by At

I am writing a computer program that involves generating 4 random numbers, a, b, c, and d, the sum of which should equal 100.

Here is the method I first came up with to achieve that goal, in pseudocode:

Generate a random number out of 100. (Let's say it generates 16).
Assign this value as the first number, so a = 16.
Take away a from 100, which gives 84.

Generate a random number out of 84. (Let's say it generates 21).
Assign this value as the second number, so b = 21.
Take away b from 84, which gives 63.

Generate a random number out of 63. (Let's say it generates 40).
Assign this value as the third number, so c = 40.
Take away c from 63, which gives 23.

Assign the remainder as the fourth number, so d = 23.

However, for some reason I have a funny feeling about this method. Am I truly generating four random numbers that sum to 100 here? Would this be equivalent to me generating four random numbers out of 100 over and over again, and only accepting when the sum is 100? Or am I creating some sort of bias by picking a random number out of 100, and then a random number out of the remainder, and so on? Thanks.

6

There are 6 best solutions below

8
On BEST ANSWER

No, this is not a good approach - half the time, the first element will be $50$ or more, which is way too often. Essentially, the odds that the first element is $100$ should not be the same as the odds that the first elements is $10$. There is only one way for $a=100$, but there are loads of ways for $a=10$.

The number of such sums $100=a+b+c+d$ with $a,b,c,d\geq 0$ integers, is: $\binom{100+3}{3}$. If your algorithm doesn't randomly choose from $1$ to some multiple of $103$, you can't get an even probability.

An ideal approach. Let pick a number $x_1$ from $1$ to $103$. Then pick a different number $x_2\neq x_1$ from $1$ to $103$, then pick a third number $x_3\neq x_1,x_2$ from $1$ to $103$.

Then sort these values, so that $x_1<x_2<x_3$. Then set $$a=x_1-1, b=x_2-x_1-1, c=x_3-x_2-1, d=103-x_3.$$

1
On

there may be a need for slightly more precise specification of what kind of sample you want. but to begin with you may feel less uneasy if you sample by picking three numbers at random in $[0,100] \cap \mathbb{Z}$, let us call them $a,b,c$ supposing you have ordered them so that $0 \le a \le b \le c \le 100$

now set: $$ x_1 = a \\ x_2 = b-a \\ x_3 = c-b \\ x_4 = 100-c \\ $$ now you have $$ \sum_{k=1}^4 x_k = 100-c+c-b+b-a+a =100 $$

1
On

Are computational constraints really an issue? Do you intend to scale this up to higher numbers? Does this method need to be achieved using physical dice, a dice rolling program, Excel, or a programming language?

As Thomas Andrews points out, using that method will bias towards something like 50/25/12/12 compared to 25/25/25/25.

Why not just roll 4 dice between 0-100, and check if they sum to 100? If they do, keep it, otherwise roll again. Below Java code was tested with 4 numbers up to 1,000,000 and returned results within ~3 seconds. For larger numbers, you will need to be smarter.

public class RollerMain {

    public static void main(String[] args) {
        while (true) {
            int firstNumber = (int)((Math.random())*101);
            int secondNumber = (int)((Math.random())*101);
            int thirdNumber= (int)((Math.random())*101);
            int fourthNumber= (int)((Math.random())*101);
            if (firstNumber + secondNumber + thirdNumber + fourthNumber == 100){
                System.out.println("first number = "+firstNumber);
                System.out.println("second number = "+secondNumber );
                System.out.println("third number = "+thirdNumber );
                System.out.println("fourth number = "+fourthNumber);
                break;
            }
        }
    }
}
3
On

Generate four random numbers between $0$ and $1$

Add these four numbers; then divide each of the four numbers by the sum, multiply by $100$, and round to the nearest integer.

Check that the four integers add to $100$ (they will, two thirds of the time). If they don't (rounding errors), try again...

0
On

Your question mention an inefficient algorithm generating four independent and uniformly distributed numbers among the integers from 0 to 100 and repeating until their sum is 100. I'll assume you are satisfied with the distribution generated by that algorithm, but you are not satisfied with the performance.

Before looking into how to produce the distribution more efficiently, one first has to understand what the distribution looks like.

By construction it is easy to see that each of $a$, $b$, $c$, and $d$ are identically distributed. It is also easy to see that they are not independent due to their sum being constant. What we already know about their distribution is that it has minimum value 0, maximum value 100, and average value 25. The average follows from the fact that their sum has to be 100 on average.

This rules out a uniform distribution of the individual numbers (and in fact it rules out every symmetrical distribution). This means your more efficient algorithm, which generates $a$ uniformly will produce a different distribution.

Towards an efficient algorithm

If we define $X = a+b$ and ask what the distribution of $X$ looks like, we will find something interesting. The distribution clearly doesn't depend on which pair of the four numbers we summed. So all six possible pairs are identically distributed, but not independent. This distribution has minimum 0, maximum 100, and average 50. And the distribution has to be symmetrical because $X$ and $100-X$ are identically distributed.

It is not immediately obvious if the distribution of $X$ is uniform across the integers form 0 to 100. However if the distribution of $X$ can be generated efficiently, then the distribution of all four numbers can be generated efficiently as follows:

  • Generate $X$
  • Choose $a$ uniformly random in the range $0$ to $X$
  • Let $b := X-a$
  • Choose $c$ uniformly random in the range $0$ to $100-X$
  • Let $d := 100-X-b$

The distribution of X

The original algorithm would produce $X$ as the sum of two uniformly random numbers in the range $0$ to $100$, but discard any results where the overall sum was different form $100$.

A different algorithm could generate $X$ and $Y$ according to this distribution and discard the result if $X+Y \neq 100$. This is useful because the generation of $X$ and $Y$ can be simplified.

If $X$ is larger than 100 it can be discarded immediately. We easily analyze what the new distribution before we verify the sum of $X$ and $Y$ will be. The initial probability of an outcome $x \in [0;100]$ would be $\frac{1+x}{10000}$, but when we discard values larger than 100, the probability will be $\frac{1+x}{5050}$.

The probability of immediately generating $X=x$ and $Y=100-x$ can then be computed as $\frac{1+x}{5050} \cdot \frac{1+(100-x)}{5050} = \frac{(1+x)(101-x)}{5050^2}$ The probability of $P(X=x \wedge Y=100-x)$ can then be computed by simply scaling the denominator such that the sum will be $1$

At this point it is clear that $X$ isn't uniformly distributed. But it also gives us a way to construct $X$ directly.

In order to generate the distribution of $X$ directly, we need a formula for $P(X \leq x)$. This formula will be:

$$P(X \leq x) = \frac{\Sigma_{i=0}^x (1+x)(101-x)}k = \frac{-2x^3 + 297x^2 + 905x + 606}{6k}$$

Because we know that $P(X \leq 100) = 1$, we can deduce that $k=176851$.

With this the algorithm becomes:

  • Choose $r$ uniformly random from the integers $[0;176850]$
  • Take smallest $x$ such that $\Sigma_{i=0}^x (1+x)(101-x) \geq r$
  • Choose $a$ uniformly random in the range $0$ to $x$
  • Let $b := x-a$
  • Choose $c$ uniformly random in the range $0$ to $100-x$
  • Let $d := 100-x-b$
0
On

The method described will be random, as will any number of schemes. Whether the method is suitable depends on the application. (For example, we could choose $\{x_1,x_2,x_3,x_4\}$ by taking a simple random sample from $\{\{10,20,30,40\},\{25,25,25,25\},\{0,0,0,100\}\}$, but I suspect the properties of such a random sample would not be desirable for most applications).

One method that produces samples with potentially desirable qualities is to sample $\{x_1,..,x_m\}$ from all integer compositions of $n=100$ of length $m=4$ (with zeros allowed).

The total number of such compositions is $\binom{n+m-1}{m-1}$. The distribution of $x_i$ is

$$P(x_j=i)=\frac{\binom{n+m-i-2}{m-2}}{\binom{n+m-1}{m-1}}$$

which leads to the following iterative sampling scheme:

For $j=1...m$, set $x_j=i\in[0,n-\sum_{k=0}^{j-1}x_k]$ with probability

$$\frac{\binom{n+m-i-j-1-\sum_{k=0}^{j-1}x_k}{m-j-1}}{\binom{n+m-j}{m-j}}$$

where $x_0\equiv0$.

The distribution of $\{x_1,..,x_m\}$ will be the same as in this answer, but without resorting to rejection sampling.

As an R function:

rcomp <- function(n, m) {
  if (m == 1L) {
    n
  } else {
    m2 <- m - 2L
    x <- sample(0:n, 1, 1, choose((n + m2):m2, m2))
    c(x, Recall(n - x, m - 1L))
  }
}

For $n=100$, $m=4$, compare the distribution of $x_i$ from samples generated from rcomp with the actual distribution.

n <- 100L
m <- 4L
r1 <- replicate(choose(n + m - 1, m - 1), rcomp(n, m))
r2 <- partitions::compositions(n, m)
all(colSums(r1) == n)
#> [1] TRUE
plot(ecdf(unlist(r2)), col = "blue", ylab = "CDF", main = "Distribution of x")
lines(ecdf(unlist(r1)), col = "orange", pch = 20)
legend("topleft", legend = c("Actual", "Sample"), col = c("blue", "orange"), pch = c(1, 20))

enter image description here

A couple notes: for relatively small $n$, as we have here, it will be faster to sample partitions::compositions(n, m) than to use rcomp. Also, rcomp will begin to experience numeric difficulties with large $n$. In this case, one approach could be to approximate the distribution with a discretized Dirichlet distribution, which is straightforward to sample. The discretization scheme can be implemented to ensure $\sum x_i=n$.