How to reason about a mixed Beta prior when tossing coins?

63 Views Asked by At

Let's say we have just one coin. Before observing any tosses, we start with a flat prior for the probability of heads, $p$. In other words uniform or Beta with parameters $1$ and $1$. If we manage to see a million coin tosses in advance with a roughly equal number of heads and tails, the prior becomes a Beta that highly peaked around $0.5$. Now if someone says they'll offer a chance to see $10$ more tosses, we can safely pass since our prior distribution has already given us an almost point estimate for $p$.

Now, let's say there is a machine that outputs only two kinds of coins. It'll either return a coin with $p=0.1$ or a coin with $p=0.9$ with equal probabilities. We don't know any of this, and are given the chance to do what we want with the machine for $24$ hours. So we make it generate a thousand coins and toss each of them a thousand times (hence generating again a million tosses). And across these million tosses, the number of heads and tails is roughly the same. The next day, the machine generates a coin and were invited to play a game that involves tossing this coin. Someone offers to toss this coin $10$ times before the game. Now, these $10$ tosses will help us a lot since we'll get an idea of what kind of coin the machine generated.

It's clear that in the second scenario, the prior distribution we have isn't very helpful while in the first, it's incredibly helpful. So what is the prior distribution in the second case and how does it manage to remain unhelpful despite a million tosses?

1

There are 1 best solutions below

0
On

I think we need to clearly spell out the entity to which the prior/posterior distributions apply. First you have a machine that produces coins with a distribution $D_m$, and next you sample a coin $C$ from it with a distribution $D_C$=Bernoulli($p_C$) where $p_C$=prob of heads for coin $C$ (small $m$ here indicates the one and only machine we have, while $C=c_1,c_2,\ldots$ takes on multiple values, one for each coin sampled from it). Note that both $D_m$ and $D_C$ are possibly unknown, and we may want to impose priors on them that are updated as coins are tossed.

Extreme 1: Let's consider your first scenario, where a million coins are drawn from the machine and each is tossed once. For a moment assume we have an oracle which, given any coin $C$, returns the $p_C$ (probability of heads) for it.

  • With these million tosses and the oracle, you sampled $p_C$ a million times independently and you learn $D_m$ quite well.
  • On the other hand, without the oracle, you don't learn much from even a million tosses. Each coin $C$ is tossed only once, so $p_C$ is "largely" unknown. Aggregated across the million independently sampled coins, all you know is the mean of $p_C$ under $D_m$ and little else about $D_m$.

Extreme 2: In the other extreme, say a single coin $C=c$ is tossed a million times. Then we know everything about $c$, including $p_c$, but nothing about $D_m$ as we sample from it only once.

Middle ground: In between these extreme lies the second scenario, where a thousand coins are each tossed a thousand times; this is closer to the optimal for learning $D_m$. You have drawn 1,000 samples (coins) from $D_m$ and for each you have a very good estimate of $p_C$.

Back to your question:

in the second scenario, the prior distribution we have isn't very helpful

This is correct, since we're able to learn $D_m$ well

in the first scenario, the prior distribution is incredibly helpful

This is correct, since we can't learn $D_m$, the prior's info is all we have

Side notes:

  • Learning $D_m$ involves nested simulation, and allocating budget optimally for each stage is a well-known problem. See for example https://dl.acm.org/doi/abs/10.5555/3522802.3522955
  • Specifying a prior on $D_m$ is non-trivial. You can restrict to beta distributions, and that might make it workable.