Combining binomial probabilities of many independent, non-summable variables

41 Views Asked by At

TL;DR: How can I calculate a single number to capture the overall "rarity" of falling at percentiles L_k across a large set of K non-summable, independent binomial distributions X_k, avoiding the above pitfalls?

Hi all, I've searched far and wide for an answer to this question, without success, so I apologize if this question is at all atypical for this forum, or is a bit long.

I am trying to calculate how lucky my character has been in a video game, overall. In general, each item k can be dropped with known probability p_k after defeating some boss. After N_k attempts, X_k items have been acquired. Most X_k are binomially distributed, although some are more complicated like Poisson binomially distributed.

I have defined my "luck" for a particular item, L_k, to be the chance of having received at most X_k items given N_k, p_k, and given the particular type of distribution. In other words, it's my percentile rank compared to other players in terms of # items received vs. # items expected. For a simple binomial distribution Binom(N_k, p_k), this is just equal to the cumulative binomial probability Pr(x <= X_k).

The trouble I am having is how to combine all L_k into an overall luck score, L = F(L_k). Previous answers I've seen on the web have stated that this is also a Poisson binomial distribution, but this wrongly assumes I only care about the total number of items received after combining the distributions of all X_k. Secondly, not all X_k are binomially distributed, though we can assume this is true for now if it complicates the answer.

I initially thought that all L_k are actually uniformly distributed, because I've defined luck to be my percentile compared to all other hypothetical players. It's equally likely to be at any specific percentile (for example, floor(L_k * 100) = 2 is just as likely as floor(L_k * 100) = 99 because, of course, each luck "bucket" (of any fixed size) contains the same number of players.

So, I thought I could combine all L_k according to an Irwin-Hall distribution: "The Irwin–Hall distribution is the continuous probability distribution for the sum of n independent and identically distributed U(0, 1) random variables".

But this actually doesn't make any sense, numerically, after checking the results. Consider you are in the 0.0000001th percentile for number of items k=1 received, and in the 70th percentile for another item k=2. According to the Irwin-Hall distribution, the combined luck / probability would be something like 35%, but this of course must be incorrect given that you were so incredibly unlucky for 1 item but slightly above average for another item. Maybe the idea of summing these variables is just wrong, because they don't represent the same thing.

I have this rough sense that Lucks are (very roughly) multiplicative (e.g., being in the 1% percentile for 2 different items is surely rarer than 1 in 100, assuming they're the only items, because you're doubly unlucky), so I discounted all types of averaging methods (harmonic, geometric, regular mean, etc.). I did have decent success combining the "probabilities" L_k in logit space by summing log-odds, then converting back, but then I had another issue:

Given many possible types of items (k > 1000) it is expected that your average luck will deviate a certain amount from 0.5 (or, at least, as the average approaches 0.5 across all items, the sum of logits will diverge from 0). So, summing all of these logits makes it extremely likely that the overall Luck L will be nearly 0 or 1, which ruins Luck's interpretation as a percentile where you are ranked compared to other hypothetical players in relation to receiving many rare items.

There is also another fundamental flaw where defining L_k = Pr(x <= X_k) doesn't make any sense when both p_k and N_k is very low. Surely confidence intervals of some sort come into play here, since defeating a boss 1 time and not receiving the item means very little.

So, my question is: How can I calculate a single number to capture the overall "rarity" of falling at percentiles L_k across a large set of K non-summable, independent, (mostly) binomial distributions X_k, avoiding the above pitfalls?

And as a follow-up, I may want to assign an importance weight I_k to each item, indicating either difficulty / time-investment of defeating the boss or the monetary value of the item, though again, it can be ignored if it complicates the question too much.