Probabilities seem to be growing exponentially

155 Views Asked by At

We have instituted random drug testing at our company. I was charged with writing the code to generate the weekly random list of employees. We've gotten some complaints because of people getting picked more frequently than they expected and our lawyer wants some evidence that these events fall within the bell curve of randomness.

I'm very confident in our code. I have now written several Monte Carlo simulations that back up the results we've had. That said, all my Monte Carlo simulations (each written from scratch, completely independently) also show a phenomenon that I can't explain and I'm hoping others can.

Here are the parameters I'm using: 4550 employees, of which 91 (2%) are picked each week at random.

The phenomenon we're encountering is this:

Over the first 20 weeks, we expect (according to the Monte Carlo simulations) roughly 32 people to be picked 3+ times (and around 2.7 people to be picked 4+ times, but let's just stick with the people picked 3+ times). And we've had the program going for about 20 weeks and the numbers seem to agree so far.

Over the first 40 weeks, the number of people picked 3+ times shoots up to 207 (more than 6 times as many in twice the time).

Over the first 52 weeks, the number shoots up again to 390 (30% more time than 40 weeks, but 90% more people picked 3 times).

Maybe I've written all my Monte Carlo simulations wrong, but I'm pretty sure I haven't. I've looked at all this a bunch of different ways and I'm convinced this phenomenon is real, but I need to be able to explain it to the VP of HR and I'm not sure why the number of people picked 3 times rises so fast from say 40 to 52 weeks (and this is all of the counts. The number of people picked 4+ times, the number of people picked 5+ times, etc).

I do understand that say, in the first 4 weeks, you can't possibly have anyone picked 5 times, so the first week where that would be possible would be week 5. So after 10 weeks, you have 5 times as many opportunities for someone to be picked 5 times as you do in 5 weeks (500% increase in odds over increase 100% in time).

But I'm not sure that explains the 40 week to 52 week changes. Or does it?

I've also ruled out any issues with the random number generator (I get the roughly the same results using the basic one as I do using the random number generator from the cryptography library).

Thanks to anyone who can explain this in a way that I can take back to HR and our legal guys.

Update

To expound a bit on the process, here's an example: I have a database table that I've created called DrugTest. It has 2 columns: TestRun and Employee. Both columns are integers.

So for 52 weeks, I have TestRun values of 1 to 52 and then I have 91 random employee numbers (numbers between 0 and 4549) for each TestRun value. No employee can be picked twice in the same week (the primary key is (TestRun, Employee), ensuring unique employee numbers for each TestRun value).

For a sample run, I loaded up 52 weeks of data. Then I execute the following query:

select employee, count(*) as cnt
from DrugTest
where TestRun <= 52
group by employee
having count(*) = 3
order by 2

The above query returns 313 results

select employee, count(*) as cnt
from DrugTest
where TestRun <= 40
group by employee
having count(*) = 3
order by 2

The above query returns 178 results

select employee, count(*) as cnt
from DrugTest
where TestRun <= 20
group by employee
having count(*) = 3
order by 2

The above query returns 34 results

2

There are 2 best solutions below

2
On BEST ANSWER

So after 10 weeks, you have 5 times as many opportunities for someone to be picked 5 times as you do in 5 weeks (500% increase in odds over increase 100% in time).

Actually, the difference in odds is much higher. After 5 weeks, you have one possibility to be picked on 5 weeks: that is, on weeks 1, 2, 3, 4, 5. After 10 weeks, you can be picked on weeks 1, 2, 3, 4, 5; or, on weeks 1, 2, 3, 4, 6; or, on weeks 1, 2, 3, 4, 7; or, on weeks 1, 2, 3, 4, 8; or, on weeks 1, 2, 3, 4, 9; or, on weeks 1, 2, 3, 4, 10; or, on weeks 1, 3, 4, 5, 6; and so on, and so on, and so on. There are $\binom{10}{5}=\frac{10!}{5!\cdot 5!}=252$ ways to select 5 objects out 10, and the odds increase accordingly. (They do not increase precisely 252 times, but it's a decent approximation.)

Similarly, the number of ways to choose $k$ weeks out of $n$ is $\binom{n}{k} = \frac{n!}{(n-k)!k!} = \frac{n(n-1)\dots (n-k+1)}{k!}$. When $n$ is much greater that $k$, this can be approximated as $\frac{n^k}{k!}$.

So, when the number of weeks passed increases $x$ times, we can estimate that the number of people picked $k$ times will increase $x^k$ times (so the increase is actually polynomial, not exponential). In particular, if $x=1.3$ and $k=3$, we get the estimate $1.3^3=2.197$ times. The real increase in your case is $390/207\approx 1.884$ times; well, as I said, my approximation is rather crude and does not take some additional factors into account. Still, it should explain why the growth in not linear.

8
On

Let $n$ be the number of weeks you've sampled. The for any individual, the probability that they are chosen on any given week is $p=.02$. Then, the probability, that after $n$ weeks any individual has been chosen at least 3 times is:

\begin{align*} \mathbb{P}(\text{chosen } \geq 3 \text{ times} ) &= 1 - \mathbb{P}(\text{chosen}\leq 2 \text{ times}) \\ &= 1 - \left(\sum_{i=0}^{2} \binom{n}{i}p^{i}(1-p)^{n-i}\right) \\ &= 1 - (1-p)^{n} - np(1-p)^{n-1} - \frac{n(n-1)p^2}{2}(1-p)^{n-2} \end{align*}

Just picking a few values of $n$ we have that, for $n=10$, any individual has a .08% chance of being chosen at least 3 times. The expected number of employees who would be tested at least 3 times would be 3.9. For reference, call this triple $(10, .08, 3.9)$. The next few values are:

$(10, .08, 3.9)$, $(15, .3, 13.8)$, $(20, .7, 32.2)$, $(25, 1.32, 60.2)$, $(30, 2.17, 98.8)$, $(35, 3.25, 148)$, $(40, 4.57, 207.8)$. It looks like your values are a little high, but what I've put here is just what one expects. It is very unlikely to get "what's expected" but it is unlikely to deviate a long ways. The question is, is your deviation unexpectedly large? I'm not a statistician, so I cannot answer that well enough to satisfy. I would though check your code and make sure your not doing something bad with seeding the random number generator you are using. I would also be certain that you are not using your own "home-made" random number generator. Also, random number generators do not generate random numbers, let alone "uniform" random numbers. It may be useful to implement something that makes sure that people who've been chosen in the last week or two have a lower probability of being chosen again.