A statistics question: of to tell how much more likely to occur something is if it occured in the past.

91 Views Asked by At

I have a basin of about 10 000 uniquely labeled items, of which some are periodically picked.

After 150 draws, we find that 100 different items have been picked. Naturally, some items have been picked up to 4 times.

So I have a breakdown that looks a bit like this:

Picked 0 times : 9900 items

Picked 1 time :80 items

Picked 2 times :10 items

Picked 3 times :6 items

Picked 4 times : 4 items.

It’s obvious that if an item has been picked in the past, it’s more likely to be picked a second time in the future. I’d like to quantify this process. To say items that have already been selected are X% chance more likely to be picked a second time. How to I compute such a figure.

1

There are 1 best solutions below

0
On BEST ANSWER

Each item $i\in\mathcal I$ has a probability $P_i$ of being chosen each time, and here I will assume the trials are independent. If we somehow know that there are exactly $10\,000$ items each of which has positive probability of being chosen each time, then for reasons that might require a long argument, I will assume the $10\,000$-tuple $(P_i)_{i\in\mathcal I}$ is uniformly distributed in the space of all $10\,000$-tuples for which $\sum_{i\in\mathcal I} p_i=1$ and for all $i\in\mathcal i,$ $p_i>0.$

For $i\in\mathcal I,$ let $x_i$ be the number of times item $i$ has been chosen so far in $150$ trials. Then the likelihood function is

$$ L(p_i : i\in \mathcal I) = \prod_{i\in\mathcal I} p_i^{x_i}, $$ and we know that $\sum_{i\in\mathcal I} x_i = 150,$ so most of the $x$s are $0.$

The prior probability distribution may be written as $$ \text{constant} \times \prod_{i\,\in\,\mathcal I\,\smallsetminus \, \{j\}} dp_i. $$ where $j$ is some distinguished element, since there are $10\,000-1$ degrees of freedom here. It doesn't matter which element is chosen for that role.

The posterior probability distribution is a constant times the likelihood function times the prior distribution; thus it is $$ \text{constant}\times p_j^{x_j} \prod_{i\,\in\,\mathcal I\,\smallsetminus\,\{j\}} p_i^{x_i} \, dp_i. $$ This is a Dirichlet distribution with parameters $(x_i+1 : i \in \mathcal I).$ The "constant" is $$ \frac{\Gamma\left( \sum_{i\in\mathcal I} (x_i+1) \right)}{\prod_{i\in\mathcal I} \Gamma(x_i+1)}. $$ The probability that item $k$ is chosen on the $151$st trial, given the results of the first $150$ trials, is then \begin{align} & \Pr( \text{$k$ is chosen on the 151st trial} \mid X_i= x_i \text{ for } i\in\mathcal I) \\[8pt] = {} & \operatorname E( P_k \mid X_i= x_i \text{ for } i\in\mathcal I) \\[8pt] = {} & \text{expected value of a Beta distribution,} \\ & \text{since the marginals of a Dirichlet distribution} \\ & \text{are Beta distributions} \\[8pt] = {} & \frac{x_k+1}{\sum_{i\in\mathcal I} (x_i+1)}. \end{align}

This is actually a generalization of Laplace's problem on the probability that the sun will rise tomorrow.