Our design of algorithms class requires all students to enroll in an online $AI$ competition, where each team has to come up with a bot. Before the final lockdown, each team is allowed to challenge any other team in order to test their strategies, including the random bot provided by the course assistants.
For the first testing round, each team had to play $10$ matches with a random bot provided by the course staff. By random I mean a bot which chooses a random move, out of the possible set of moves available to it at that game state. For a draw with the random bot you get $0$ points, for a win you get $+1$ and for a loss you get $-1$ points.
Unlike other teams, we chose to avoid hardcoding, going instead with an altered version of the minimax algorithm that fits this game. Needless to say, our strategy is far from flawless, but it's a lot better than what most others came up with.
Relevant facts :
$\bullet$ We lost $5$ times and won $5$ times in the testing round. So we got $0$ points.
$\bullet$ During or practice matches we got an $80 \text{%}$ win rate with the random bot.
$\bullet$ Also during the practice matches we played against a lot of other competing teams. One of the teams we played against had a very weak strategy, done through hardcoding. We got $4/4$ wins in the matches against their bot. Another team (their bot was also hardcoded) we've played against $4$ times, managed to beat us $1/4$ times, but we still beat them $3/4$ times. The former got a score of $10/10$ in the testing matches, while the latter got a $7/10$ ($3$ out of $10$ were draws).
$\bullet$ None of the $2$ teams I mentioned above updated their strategy between the time they played against us and the time of the testing.
$\bullet$ We had the worst score off all the teams tested in this round, though more than half of them were much weaker than us (as we've seen in the practice matches).
$\bullet$ The rules of the game in question can be found here (MSE link).
Not much can be done about our wasted time, but I would really love to see if there's a mathematical way to quantify that their grading is really flawed. I'm certain the randomness factor is quite relevant here, but I don't have any training in probability theory or chaos theory, so I can't model this situation.
How would you mathematically prove the grading system is wrong?
The grading system that is proposed here is unfair in the sense that it is random. Suppose a win chance of $p$ and a lose chance of $q$ against the random bot, where $p+q\le1$. Then the probability of getting a high score in ten games becomes better when $p$ gets higher and/or $q$ gets lower.
However, suppose you have a very high $p$ and a very low $q$. Then it is still possible to get a low score, resulting in a error between the actual level and the perceived level of the program. Since, I assume, every move in the game is calculated by a computer program you could just run the programs a high amount of times, say a 1000 times, against eachother. Note, the resulting score is still random. However, by the Law of Large Numbers, we know that in the limit the fraction of won games converges to $p$ and thus the error between the actual and perceived level converges to zero. In other words, by playing more games the probability of having a large error becomes smaller and smaller resulting in a high probability of having a fair score.
A completely other argument is that your programs are only scored on their performance against a random bot. Note that one program could do bad against a random program but perform very well against all other strategies. In most eyes this program would be considered superior to a program which loses to everything that is not random, but has a high winning chance against a random program.