$UCB-\alpha$ policy for multi-armed bandit - conditions on UCB indices for picking suboptimal arm

45 Views Asked by At

While reading the optimality proof for the $UCB-\alpha$ policy for the multi-armed bandit problem , I came across a claim which I couldn't understand the logic of.

Notations:

$I_{i}(t) = \hat{\mu}_{i}(t)+\sqrt{\frac{\alpha \log t}{\tau_{i}(t)}}$ - the UCB index for suboptimal arm $i$.

$I_{\ast}(t) = \hat{\mu}_{\ast}(t)+\sqrt{\frac{\alpha \log t}{\tau_{\ast}(t)}}$ - the UCB index of the optimal arm.

$\Delta_{i} \triangleq \mu_{*}-\mu_{i}$

The claim:

" To have $I_{i}(t) \geq I_{*}(t)$, at least one of the following must hold: $$ \begin{aligned} \Delta_{i} & <2 \sqrt{\alpha \frac{\log t}{\tau_{i}(t)}}, \\ I_{*}(t) & \leq \mu_{*}, \\ I_{i}(t) & \geq \mu_{i}+2 \sqrt{\alpha \frac{\log t}{\tau_{i}(t)}} . \end{aligned} $$

This can be seen by noticing that if none of the above three inequalities is true, then the gap $\Delta_{i}$ between $\mu_{*}$ and $\mu_{i}$ is at least $2 \sqrt{\alpha \frac{\log t}{\tau_{i}(t)}}$ and the index $I_{*}(t)$ of the optimal arm exceeds $\mu_{*}$ while the index $I_{i}(t)$ of arm $i$ does not make up the minimum gap between $\mu_{*}$ and $\mu_{i}$, leading to $I_{i}(t)<I_{*}(t)$

"

I tried plugging in the definitions for the UCB indices to any of the conditions, but couldn't see how it helps proving that the index of the suboptimal arm is larger than the index of the optimal arm. I would appreciate any further explanation!

The article:

Finite-time analysis of the multi-armed bandit problem (Cesa-Bianchi, and Fischer, 2002)