Determining the probability that a group within a population is of any given size based on limited sampling

118 Views Asked by At

I have a population of $N (\approx8000)$ entities, each numbered from $0$ to $N-1$ (sorry, I'm a programmer). This population is split into two groups, $A$ and $B$, based on a simple boolean characteristic of each entity, with entities that do not have this characteristic belonging to group $A$, and those that do belonging to group $B$.

Determining this property for any entity within the population can only be done manually, and takes a few seconds to sometimes a minute. However, any entity can be checked for this characteristic easily at any time just given its number.

At some known, really rare points in time, any entity may gain or lose this particular characteristic with a fairly low probability. So at these points in time, a slight change in the sizes of the groups is expected, however, $N$ stays constant.

From some earlier measurements (from before a few such points in time where changes might have occurred), I know that $|A| \approx 7750$ and $|B| \approx 250$. And in the future, after any further such points in time, I'd like to keep these estimates up to date by doing additional sampling.

So, what I'd like to be able to do is, make a short sampling by checking some $M (\approx 100)$ number of entities, chosen in any way; then, seeing how many of the sampled entities have the interesting characteristic (with a random sampling of this size, I expect that there should be about $3$ such entities), determine the probability that $|B| = S_B$ holds true for any given $0 \le S_B < N$. Let $p(S_B)$ be the result of any sampling session: a function that gives this probability wrt. $S_B$.

Of course, in this example (if I really found exactly $3$ out of the $100$ sampled entities having the interesting characteristic), $p(S_B)$ should be $0$ for any $S_B < 3$ and also for any $S_B > N-97$ - hopefully with a slight peak already at around $|B|$.

And if all entities were to be sampled (which is the infeasible $M = N$ case), $p(S_B)$ should be $0$ everywhere, except for $p(|B|) = 1$.

EDIT:

So, I was able to use Maximum Likelihood Estimation, as suggested by @lulu. This is what I got: 1

As generated by this Python 3 program: https://gist.github.com/torokati44/789c27a49ba06ba50d5ff15e85409921

This is very close to what I expected to get! My problem is, that I believe this applies to the case with repetitions (put-backs) - and I am able to do the sampling without any repetitions, just by selecting the checked entities differently.

I would expect that after checking all $8000$ entities, the function would be $0$ everywhere, and $1$ exactly at the actual $|B|$ only.

Is there a different formula that accounts for sampling without repetitions?


Disclaimer: This is not homework, this has nothing to do with my job, and the actual outcome isn't even that important, I just got curious as to whether these probabilities can be determined somehow, given the practical constraints at hand.

1

There are 1 best solutions below

0
On

Me being the lowly programmer that I am - and one with access to a fairly fast computer, no less - I did the not at all surprising thing: went all BF-N & MC $^{[1]}$ on the problem.

So, I set up an experiment, within for each $0 <= n <= N$, I randomly selected $n$ out of $N$ of the entities to be "bad". Then checked on some fraction (eg. $M=4000$) of the entities selected at random. Then repeated this an arbitrary, but high, but not too high, $I = 10000$ number of times for each $n$.

If the number of found "bad" entities happened to be equal to the pre-determined $B$ (eg. $125$), I incremented a counter corresponding to $n$. Finally, I normalized all the gathered counters by dividing the total number of cases where the found number of bad entities was equal to $B$. Essentially computing a conditional probability.

This is the Rust program - for more performance and trivial parallelism - I used: https://gist.github.com/torokati44/a84f6c94c706637e7f794e0e5f8434cb

To verify my method, I first selected the checked entities with replacements (so, allowing to check the same entity any number of times).

Here are the results: 1

While the Monte Carlo results are noisy, and not really clear, at least the overall picture matches the analytical solution in the edited question. I think I could make it cleaner by simply increasing $I$ and waiting longer, or by applying some smoothing (eg. convolution with a Hanning window) to the data points, or trying to fit a polynomial or Gauss function to them.

Then I compared two simulations, one where replacements are allowed, and one where each checked entity is guaranteed to be looked at only once:

2

There is hardly any difference in the $M<1000$ cases, but in $M=4000$, the curve is considerably thinner and taller. Of course, $100 << 4000$, so this indeed is not that useful.

Anyway, at least with $M=8000$, I get the Dirac delta I desired, when disallowing repetitions:

3

While this method may not be mathematically elegant or particularly pleasing, I think it's good enough for what I needed it for - that is, nothing serious.

$[1]:$ Brute-Force Numerical & Monte Carlo