I have a random process of a game (Backgammon) and I have developed two different game agents that can play a game. Now I match these two players against each other in N games. I have a hypothesis that agent A plays better than agent B. So I believe there is a bias towards A. How can I prove (or at lest get some confidence) that agent A plays stronger than agent B based on the outcome of the N games?
I assume there is only win or lose outcome of the games.
This feels like a really simple question, but I think I do it incorrectly.
Here is what I do: I've played 11000010 games between the two players (N=11000010). Out of these agent A wins 5528841 games and hence agent B wins 5471169 games. I then calculate $\hat\mu$ as $\frac{5528841}{11000010}=0.5026214521623162$.
Then I calculate standard deviation of the experiment $\sigma = N \hat\mu (1-\hat\mu)$. (?) And then $\frac{\hat\mu - 0.5}{\sigma/\sqrt{N}}$ which gives me a really small number that I really don't trust.
The null hypothesis $H_0$ is that the $N$ games are independent Bernoulli trials where $B$'s probability of winning is at least $1/2$. Thus the number $X$ of games $B$ wins is a binomial random variable with parameters $N$ and $p$, with $p \ge 1/2$. Then $$\mathbb P(X \le x \mid H_0) \le \sum_{k=0}^x {N \choose k} 2^{-N}$$ If $N$ is very large, you might wish to approximate this by a normal distribution, but for reasonable-sized $N$ it is not hard to compute directly.
If you want a test at confidence level $\alpha$, choose $N$ and $x$ such that $\sum_{k=0}^x {N \choose k} 2^{-N} \le \alpha$. You will then play the $N$ games, and reject the null hypothesis if you observe $X \le x$.