Phased version of Upper-Confidence Bound Algorithm (UCB)

184 Views Asked by Bumbble Comm At 11 Apr 2026 - 2:50

I am interested in an exercise (specifically Exercise 7.4) from Bandit Algorithms by Tor Lattimore and Csaba Szepesvari. It studies a phased version of the popular UCB algorithm. The algorithm takes as input the number of arms $k$. It proceeds as follows.

Choose each arm once.
for phases $\ell=1,2, \ldots$ do
Compute the arm with the highest UCB index: $A_\ell=argmax_i \textrm{UCB}_i(t-1)$
Play arm $A_{\ell}$ exactly $2^{\ell}$ times
end for

The goal is to compute the regret for this strategy.

The basic proof idea would be to set up a clean event on which all arm estimates are well concentrated around their means. On this event, we would ideally like to argue (as in the standard UCB proof), that each sub-optimal arm can only be sampled at a most logarithmic number of times in the horizon. The issue here is that it's not clear how these samples would be distributed across phases. That is, suppose a sub-optimal arm does not get sampled in the initial phases, but somehow makes its way into one of the later phases. In this case, since these later phases stretch for longer durations, and the arm-statistics are not updated in between, this can lead to a large regret.

Is there a way to work around this difficulty? Any help would be much appreciated.

Original Q&A

Phased version of Upper-Confidence Bound Algorithm (UCB)

Related Questions in PROBABILITY-THEORY

Related Questions in STATISTICS

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions