I have two programs for playing a $2$ player zero-sum perfect information game.
The game has a very high "branching factor".
No luck is involved, but game results are chaotic due to a rather large number of starting states, so when two programs play, the better program may win only a modest percent more often.
My question is how many games must I let them play, using random starting states, before I am $95\%$ confident that I have identified the better program?
This might be simple, but my statistics course was back in the 70s ;)
Alternate form of question: $X$ wins $255$ games and $Y$ only $245$.
How certain am I that $X$ is the better player?
Looks like the "Sign Test" is what I need, and there's an online version at http://www.fon.hum.uva.nl/Service/Statistics/Sign_Test.html