Number of rounds to find out the winner in nondeterministic game

162 Views Asked by At

Imagine, I'm organizing competition for AI Yahtzee players (or some other game, poker, backgammon etc.). I want to find out who plays better: player A or player B. If they play just one game, winner could win because he plays better, or because it's just luck. So they need to play several games to find out the winner.

And I have two questions:

  • how many games they should play? (More is better, but how much is enough?)
  • what percentage if wins is enough to determine winner? (For example 46% to 54% seems like draw, and 76% to 24% seems like victory, but this is just an intuition.)

I understand, there is no exact answer here, but what are general methods to use? What should I measure to give the answer? (I can have any number of "warmup" games between A and B).

1

There are 1 best solutions below

2
On BEST ANSWER

You could run statistical hypothesis tests on the share of wins of one player. But to be honest I doubt this is very practical, unless you plan to let these players play more than 20 games or so. Also, you cannot figure out "for certain" who the better player is. Assuming the outcome of these games is somewhat random, even a world class player can lose 10 times in a row to a complete amateur, although the probability of this happening is very slim (but not zero).

Assume that the outcome of games $g=1,2,\ldots n$ between two players $A$ and $B$ are independent (i.e., no daily shape etc.). Denote by $x_g\in\{0,1\}$ the outcome of game $g$ ($x_g=1$ if and only if player $A$ won).

Now the share of games won for player $A$ is the mean $s(A,B,n)=n^{-1}\sum_{g=1}^n x_g$. You can now run statistical, nonparametric tests that the true share $s$ is equal to 1/2,i.e., that both players are equally strong (equally likely to win). To do this, you can compute confidence intervals of the mean $s(A,B)$ at, say, the conventional 95% level. To do this, just compute the standard error $SE$ of the mean, then the 95% confidence interval would be $$CI=s(A,B,n)\pm 1.96 SE.$$ The standard error is $SE=\sigma/\sqrt{n}$, where $\sigma=\sqrt{s(A,B,n)(1-s(A,B,n))}$ is the standard deviation of the mean. Clearly, the SE becomes smaller as more games are played, and the confidence interval becomes narrower. This reflects the fact that with more games we are more confident that the true $s$ (which you would only obtain by letting the players play $n\to\infty$ games) is close to the sample mean $s(A,B,n)$. This is why I mentioned you might have to let these players play a lot of games before you can say, with a reasonable degree of confidence, that one is better than the other.

For small $n$, the 1.96 should be replaced by the respective critical value of the t-distribution, which is larger (here I assumed $n$ is large, so the value converges to the corresponding value of the standard normal distribution).

The test is now very simple. If $1/2\notin [\underline{CI},\overline{CI}]$, then you reject the hypothesis that the players are equally strong at the 95% level. You can also more directly test the hypothesis $s>1/2$ or $s<1/2$ ($A$ better than $B$, or $B$ better than $A$).

Just to get an idea, here's an example. Suppose players played 10 games, $n=10$, and $s(A,B,10)=0.6$. Accordingly, $[\underline{CI},\overline{CI}]=[0.296,0.903]$, so the CI is still very broad. Indeed, it requires about $n=100$ so that $\underline{CI}>1/2$, which would allow you to reject the hypothesis that players are equally strong with 95% confidence. In short, everything I described here tells you for $s(A,B,n)$ close to 1 or 0 with $n$ relatively large that this player is better, but you know that also without all those formulae.

My advice, just stick to rules of thumb: whoever has more wins after 10 games is better. Statistically, that player is more likely to be the better player, even though we may not be very confident for so few games about that (see example).