Application of Hoeffding's inequality to the Stochastic Multi-Armed Bandit Problem

Question

Application of Hoeffding's inequality to the Stochastic Multi-Armed Bandit Problem

278 Views Asked by Bumbble Comm At 02 Apr 2026 - 7:54

I'm following this note to learn about deriving an upper bound of the UCB algorithm on the Stochastic Multi-Armed Bandit Problem. In particular, the proof of Lemma 15.6 there connotes that we can apply Hoeffding's inequality to $$ Pr\left(\frac{1}{N_{t,a}}\sum_{s=1}^t Z_{s,a} I_{A_s=a} - \mu_a \geq \epsilon \right) $$ where $Z_{s,a}$ are iid with mean $\mu_a$ and are independent of other rewards, $A_s$ is the action we make at time $s$ which is in general a function of the past rewards and past actions, $N_{t,a}=\sum_{s=1}^t I_{A_s=a}$.

Of course, a natural starting point is to condition on $N_{t,a}=n$, so that inside the probability there are $n$ iid random variables. However, I cannot proceed because $Z$'s are clearly not independent of $N_{t,a}$ (as $N_{t,a}$ depends on $(A_s)_{s=1,...,t}$ and each $A_s$ depends on $(Z_{i,a})_{i=1,...,s-1, a\in A}$; in particular, UCB speficies $A_{s+1}= argmax_a U_{s,a}:= \frac{1}{N_{t,a}}\sum_{s=1}^t Z_{s,a} I_{A_s=a} + \sqrt{\frac{log(s)}{2N_{s,a}}}$).

Similarly, online notes here (page 2, Lemma 1) and here (page 8, Theorem 4) all directly claim that we can apply Hoeffding's inequality to the above inequality without justification.

I'm aware that there are other methods to derive an upper bound for UCB, but I'd really like to confirm if this approach is a dead end.

Original Q&A

There are 2 best solutions below

Bumbble Comm On 28 Apr 2023 - 11:38

Yes, in general, $\{Z_{s,a}\}_{s=1}^t$ are not independent given $N_{t,a}$. So we need the concept of Martingale and use the Azuma-Hoeffding inequality.

In fact, in the original paper, the authors indeed used it. See Fact 1 in Finite-time Analysis of the Multiarmed Bandit Problem by Peter Auer et al. Although the inequality is somehow named "Chernoff-Hoeffding bounds".

**Bumbble Comm** · Accepted Answer

I believe I have a simple counterexample to the approach of conditioning on $N$. Consider the UCB algorithm for $k=2$ (the action space has two elements only). According to the algorithm, at $t=1$, we choose $A_1=1$ and sample $Z_{t=1,a=1}$; and at $t=2$, we choose $A_2=2$ and sample $Z_{t=2,a=2}$. $\tag*{}$

Then $t=3$ gives the counterexample: we set $A_3=\text{argmax}_a \{U_{t=2,a=1}, U_{t=2,a=2}\}$, which is equivalent to choosing $A_3=1$ if $Z_{1,1}>Z_{2,2}$ and $A_3=2$ otherwise, then we sample $Z_{3,A_3}$. Consider the following

$$ P( \frac{1}{2} (Z_{t=1,a=1} + Z_{t=3,a=1}) - \mu_1 \geq \epsilon | N_{3,1}=2) = P( \frac{1}{2} (Z_{1,1} + Z_{3,1}) - \mu_1 \geq \epsilon | Z_{1,1}>Z_{2,2}) $$

This shows $Z$’s and $N_{t,a}$ cannot be independent in general; furthermore, conditioning on $N$ may alter $Z$'s distribution, e.g. making $Z$'s no longer have the same mean.

Hence, I believe this is an error in the notes. To still work out an overall bound for the risk, we need to consider other approaches altogether. e.g.

https://tor-lattimore.com/downloads/book/book.pdf (page 104, theorem 7.1)

https://dash.harvard.edu/bitstream/handle/1/37364663/ZHAO-SENIORTHESIS-2020.pdf?sequence=1&isAllowed=y (page 22, theorem 2.7)

Thanks to @JohnPrice's pointer, it's possible to show that $$ Pr\left(\frac{1}{N_{t,a}}\sum_{s=1}^t Z_{s,a} I_{A_s=a} - \mu_a \geq \sqrt{\frac{\log(1/\delta)k}{2N_{t,a}^2}} \right) \leq \delta^{\frac{k}{t}}, \forall k $$ (this answer is adapted from Lemma 6.2).

Proof: Define $X_i:=(Z_{i,a}-\mu_a)\mathbb{I}_{A_{i}=a}$. Define $\mathcal{F}_{<i}$ to be the natural filtration formed by all actions $A_j$ performed before time $i$. We claim that $\forall i, E[X_i|\mathcal{F}_{<i}]=0$. To see this, notice that the UCB algorithm for choosing the next arm to pull is deterministic when conditioned on all actions in the past; if the next arm is $a$, then the indicator equals $1$ and $E[Z_{i,a}-\mu_a]=0$ ($Z_{i,a}$ is independent of the past), if not, the indicator equals $0$.

Hence, $\{X_i\}$ is a martingale difference sequence and by Azuma-Hoeffding (general form), $$Pr\left( \sum_{i=1}^t X_i \geq \epsilon:=\sqrt{\frac{\log(1/\delta)k}{2}} \right)\leq \exp\left( - \frac{2\epsilon^2}{t} \right) = \delta^{\frac{k}{t}}$$ (where we used $\sum_{i=1}^t X_i^2 \leq t$ in Azuma-Hoeffding). Then, observe that $\left\{ \sum_{i=1}^t X_i \geq \sqrt{\frac{\log(1/\delta)k}{2}} \right\} = \left\{ \frac{1}{N_{t,a}}\sum_{s=1}^t Z_{s,a} I_{A_s=a} - \mu_a \geq \sqrt{\frac{\log(1/\delta)k}{2N_{t,a}^2}} \right\}$

PS: The proof in Lemma 6.2 claimed that it used Azuma-Hoeffding but there is one key step where it bounds $\sum_{i=1}^t X_i^2$ by $N_{t,a}$ and uses $N_{t,a}$ (which is random) in Azuma-Hoeffding, which is unjustified.

Application of Hoeffding's inequality to the Stochastic Multi-Armed Bandit Problem

There are 2 best solutions below

Related Questions in PROBABILITY

Related Questions in PROBABILITY-THEORY

Related Questions in STATISTICS

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions