I am interested in an exercise (specifically Exercise 7.4) from Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. It studies a phased version of the popular UCB algorithm. The algorithm takes as input the number of arms $k$. It proceeds as follows.
- Choose each arm once.
- for phases $\ell=1,2, \ldots$ do
Compute the arm with the highest UCB index: $A_\ell=argmax_i \textrm{UCB}_i(t-1)$
Play arm $A_{\ell}$ exactly $2^{\ell}$ times - end for
The goal is to compute the regret for this strategy.
The basic proof idea would be to set up a clean event on which all arm estimates are well concentrated around their means. On this event, we would ideally like to argue (as in the standard UCB proof), that each sub-optimal arm can only be sampled at a most logarithmic number of times in the horizon. The issue here is that it's not clear how these samples would be distributed across phases. That is, suppose a sub-optimal arm does not get sampled in the initial phases, but somehow makes its way into one of the later phases. In this case, since these later phases stretch for longer durations, and the arm-statistics are not updated in between, this can lead to a large regret.
Is there a way to work around this difficulty? Any help would be much appreciated.