Upper bound for Explore-then-commit Bandit algorithm

110 Views Asked by At

Background: From the blog, "The explore-then-commit strategy is characterized by a natural number m, which is the number of times each arm will be explored before committing. Thus the algorithm will explore for m*K rounds before choosing a single action for the remaining n-mK rounds. We can write this strategy formally as:

\begin{align*} A_t = \begin{cases} i\,, & \text{if } (t \operatorname{mod} K)+1 = i \text{ and } t \leq mK\,; \\ \text{argmax}_i \hat \mu_i(mK)\,, & t > mK\,, \end{cases} \end{align*}

where ties in the argmax will be broken in a fixed arbitrary way and $\hat \mu_i(t) $ is the average pay-off for arm i up to round t:

\begin{align*} \hat \mu_i(t) = \frac{1}{T_i(t)} \sum_{s=1}^t \tt 1{A_s = i} X_s\,. \end{align*}

Now the regret is defined as:

\begin{align*} R_n = m \sum_{i=1}^K \Delta_i + (n-mK) \sum_{i=1}^K \Delta_i \mathbb P{i = \text{argmax}_j \hat \mu_j(mK)}\,, \end{align*}

"

In the next step, they upper bound the argmax as:

\begin{align*} \mathbb P{i = \text{argmax}_j \hat \mu_j(mK)} &\leq \mathbb P{\hat \mu_i(mK) – \hat \mu_1(mK) \geq 0} \\ &= \mathbb P{\hat \mu_i(mK) – \mu_i – \hat \mu_1(mK) + \mu_1 \geq \Delta_i}\,. \end{align*}

I am unable to understand this upper bound of the argmax inside the probability. Can anyone please clarify it for me?

1

There are 1 best solutions below

4
On

The question is harder to understand due to the lack of placement of brackets. Does this help?

\begin{align*} \mathbb P[{i = \text{argmax}_j \hat \mu_j(mK)}] &= \mathbb P[{\hat \mu_i(mK) \geq \hat \mu_1(mK)\wedge\hat \mu_i(mK) \geq \hat \mu_2(mK)\wedge \dots\wedge\hat \mu_i(mK) \geq \hat \mu_m(mK)}] \\ &\leq \mathbb P[{\hat \mu_i(mK) \geq \hat \mu_1(mK)}]\\ &\leq \mathbb P[{\hat \mu_i(mK) – \hat \mu_1(mK) \geq 0}]\\ \end{align*}