Background: From the blog, "The explore-then-commit strategy is characterized by a natural number m, which is the number of times each arm will be explored before committing. Thus the algorithm will explore for m*K rounds before choosing a single action for the remaining n-mK rounds. We can write this strategy formally as:
\begin{align*} A_t = \begin{cases} i\,, & \text{if } (t \operatorname{mod} K)+1 = i \text{ and } t \leq mK\,; \\ \text{argmax}_i \hat \mu_i(mK)\,, & t > mK\,, \end{cases} \end{align*}
where ties in the argmax will be broken in a fixed arbitrary way and $\hat \mu_i(t) $ is the average pay-off for arm i up to round t:
\begin{align*} \hat \mu_i(t) = \frac{1}{T_i(t)} \sum_{s=1}^t \tt 1{A_s = i} X_s\,. \end{align*}
Now the regret is defined as:
\begin{align*} R_n = m \sum_{i=1}^K \Delta_i + (n-mK) \sum_{i=1}^K \Delta_i \mathbb P{i = \text{argmax}_j \hat \mu_j(mK)}\,, \end{align*}
"
In the next step, they upper bound the argmax as:
\begin{align*} \mathbb P{i = \text{argmax}_j \hat \mu_j(mK)} &\leq \mathbb P{\hat \mu_i(mK) – \hat \mu_1(mK) \geq 0} \\ &= \mathbb P{\hat \mu_i(mK) – \mu_i – \hat \mu_1(mK) + \mu_1 \geq \Delta_i}\,. \end{align*}
I am unable to understand this upper bound of the argmax inside the probability. Can anyone please clarify it for me?
The question is harder to understand due to the lack of placement of brackets. Does this help?
\begin{align*} \mathbb P[{i = \text{argmax}_j \hat \mu_j(mK)}] &= \mathbb P[{\hat \mu_i(mK) \geq \hat \mu_1(mK)\wedge\hat \mu_i(mK) \geq \hat \mu_2(mK)\wedge \dots\wedge\hat \mu_i(mK) \geq \hat \mu_m(mK)}] \\ &\leq \mathbb P[{\hat \mu_i(mK) \geq \hat \mu_1(mK)}]\\ &\leq \mathbb P[{\hat \mu_i(mK) – \hat \mu_1(mK) \geq 0}]\\ \end{align*}