I am reading a book about Bandit Algorithms in which the authors make the following observation. I am wondering if anyone could point me to an explanation of why the first-mover can always be exploited unless following a randomized strategy:
Readers familiar with game theory will not be surprised by the need for randomisation. The interaction between learner and adversarial bandit can be framed as a two-player zero-sum game between the learner and environment. The moves for the environment are the possible reward sequences, and for the player they are the policies. The pay-off for the environment/learner is the regret and its negation respectively. Since the player goes first, the only way to avoid being exploited is to choose a randomised policy.
The problem is that the idea of a zero-sum game is being conflated with a dynamic or sequential move game.
In a zero-sum game with a finite number of strategies, the players move simultaneously. But if my payoff is the negative of your payoff and vice versa, you can think of me as moving to minimize your payoff given that you are maximizing. Conversely, since $-\max = \min-$, you can think of me as maximizing my payoff subject to you acting to minimize my payoff. Since the game is zero sum, we must be randomizing (or if there are many equilibria, like the game in which all our payoffs are zero, there is one that involves randomization). Think of it like rock-paper-scissors: if I strictly prefer some strategy, that means you must be losing out, and you can make me worse off and you better off by adjusting your strategy. That must mean we are all indifferent over our pure strategies, given our expectations of how our opponents will behave. That means I pick my mixed strategy to make you indifferent over your pure strategies, and vice versa. This is usually called the minimax theorem: https://mathworld.wolfram.com/MinimaxTheorem.html .
All of that has no "timing". It is a simultaneous move game, and neither player moves first or second. It is only because the operators in the minimax theorem appear in "order" that you might get that impression. In games that genuinely have timing, the correct solution concept is Subgame Perfect Nash Equilibrium. In an SPNE, randomization only happens when players are indifferent over some of their payoffs at terminal histories in the game, and this is a fragile and somewhat rare situation (if you perturb the payoffs a very small amount, such indifferences are broken, and there will typically be a pure strategy SPNE). In your example, in the $\min \max$, the second mover does not get to observe the outcome of the first mover's randomization before picking their strategy, only the randomization itself (or else the second mover could always pick an optimal response to the first mover, which they typically do not; they simply know the distribution of states the first mover picked). To keep the first mover uncertain and maximize their payoff, the second-mover therefore plays randomly, just as in rock-paper-scissors.