I would like to write a Nash equilibrium solver for a simplified Poker game. I was reading that the way some Nash equilibrium solvers for Poker work is that they iterate until convergence:
For a given strategy $S$ they find the Best response strategy $S_2 = Br(S)$ (also known in Poker as the MES = Maximally Exploitative Strategy) to counter $S$. Then they compute $S_3 = Br(S_2)$. The algorithm keeps iterating, until the Nash-Distance between the last two strategies is below some predefined threshold, i.e.
$|EV(S_{n+1}, S_n) - EV(Br(S_n), S_n))| < \epsilon$
and
$|EV(S_{n+1}, S_n) - EV(S_{n+1}, Br(S_{n+1}))| < \epsilon$
This condition means that neither strategy can be exploited by any other strategy much better than the opponents strategy, and thus such pair of strategies is close to the Nash-EQ.
a) Do I understand this right?
b) How to prevent an algorithm from being stuck in an infinite loop?
Suppose we want an algorithm to find a Nash-EQ for the rock-paper-scissors game.
We choose the initial strategy of Player 1 to show rock all the time: $S$ = ${Rock 100%}$
The best response for Player 2 is then $S_2$ = $Br(S)$ = ${Paper 100%}$
The best response for Player 1 is then $S_3$ = $Br(S_2)$ = ${Scissors 100%}$
The best response for Player 2 then becomes $S_4$ = $Br(S_3)$ = ${Rock 100%}$.
As you can see, after 3 iterations, we're back to the initial strategy. How to prevent an algorithm getting stuck in a loop like that?
Didn't watch the video, but you're right that simply doing self-play by best-responding* to your opponent's current strategy can cycle/not converge to a Nash equilibrium, even in 2-player zero-sum games.
The methods that work by doing self-play like this fix the problem by best-responding to the opponent's time-averaged strategy instead of their current strategy, and then the players' time-averaged strategy will converge to a Nash equilibrium (in 2-player zero-sum games)**. For example, Fictitious Play.
A different (perhaps more accurate?) angle of looking at the intuition for the algorithms that underlie a lot of modern poker solving is through the lens of regret minimization (which I think first came out of the field of adversarial bandits). Specifically, a lot of the state-of-the-art in the past decade has used counterfactual regret minimization (CFR), which is a sequential-game generalization of a no-regret bandit algorithm called regret matching.
The quick intuition is something like this: if both players play according to one of these algorithms, their total regret is bounded sublinearly (i.e. the additional regret in each round goes to 0). Then, it can be shown that if regret goes to 0, the distribution over the players' strategies form an $\epsilon$-correlated equilibrium, which in two-player zero-sum games, means that the players' average strategies are an $\epsilon$-Nash equilibrium.
* "Best Response" is the usual terminology for a maximum exploitable strategy
** You should also be able to fix this by adding regularization to the players' policies