Magical event in the proof of lower bound on multi-armed bandits

156 Views Asked by At

We have a two armed bandit with reward means $\mu_1$, $\mu_2 \in [0,1]$. Without loss of generality let us assume that $\mu_1 > \mu_2$ (the algorithm does know about this). At every time instance $t$ an algorithm has to pick one arm, $a_t \in \{1, 2\}$ to be played. Since, the algorithm does not know that arm 1 is the better arm to play. The regret of the algorithm until time $T$ is the defined as $R_T = \sum_{t =1}^T (\mu_1 - \mu_2) \mathcal{I}\{a_t = 2\}$, where $\mathcal{I}$ is the indicator function.

The proof of lower bound on regret of a two-armed bandit given in section $2.3$ of this survey, introduces this magical event:

\begin{equation} C_n = \{ T_2(n) > \frac{1- \epsilon}{kl(\mu_2, \mu_2^\prime)} \ln n, \hat{kl}_{T_2(n)} \leq (1 - \frac{\epsilon}{2} ln(n))\}, \end{equation}

where $T_2(n)$ in the number of time arm 2 was played until time $n$, $\epsilon > 0$, $\hat{kl}_s$ is the empirical KL-divergence of $kl(\mu_2, \mu_2^\prime)$ and $kl(p, q) = p \ln \frac{p}{q} + (1-p) \ln \frac{1-p}{1-q}$. I have read this (or similar) proof in a number of papers but none of the papers tell how they came up with the event $C_n$. Could someone tell how was this event thought of?