I'm following this note to learn about deriving an upper bound of the UCB algorithm on the Stochastic Multi-Armed Bandit Problem. In particular, the proof of Lemma 15.6 there connotes that we can apply Hoeffding's inequality to $$ Pr\left(\frac{1}{N_{t,a}}\sum_{s=1}^t Z_{s,a} I_{A_s=a} - \mu_a \geq \epsilon \right) $$ where $Z_{s,a}$ are iid with mean $\mu_a$ and are independent of other rewards, $A_s$ is the action we make at time $s$ which is in general a function of the past rewards and past actions, $N_{t,a}=\sum_{s=1}^t I_{A_s=a}$.
Of course, a natural starting point is to condition on $N_{t,a}=n$, so that inside the probability there are $n$ iid random variables. However, I cannot proceed because $Z$'s are clearly not independent of $N_{t,a}$ (as $N_{t,a}$ depends on $(A_s)_{s=1,...,t}$ and each $A_s$ depends on $(Z_{i,a})_{i=1,...,s-1, a\in A}$; in particular, UCB speficies $A_{s+1}= argmax_a U_{s,a}:= \frac{1}{N_{t,a}}\sum_{s=1}^t Z_{s,a} I_{A_s=a} + \sqrt{\frac{log(s)}{2N_{s,a}}}$).
Similarly, online notes here (page 2, Lemma 1) and here (page 8, Theorem 4) all directly claim that we can apply Hoeffding's inequality to the above inequality without justification.
I'm aware that there are other methods to derive an upper bound for UCB, but I'd really like to confirm if this approach is a dead end.
I believe I have a simple counterexample to the approach of conditioning on $N$. Consider the UCB algorithm for $k=2$ (the action space has two elements only). According to the algorithm, at $t=1$, we choose $A_1=1$ and sample $Z_{t=1,a=1}$; and at $t=2$, we choose $A_2=2$ and sample $Z_{t=2,a=2}$. $\tag*{}$
Then $t=3$ gives the counterexample: we set $A_3=\text{argmax}_a \{U_{t=2,a=1}, U_{t=2,a=2}\}$, which is equivalent to choosing $A_3=1$ if $Z_{1,1}>Z_{2,2}$ and $A_3=2$ otherwise, then we sample $Z_{3,A_3}$. Consider the following
$$ P( \frac{1}{2} (Z_{t=1,a=1} + Z_{t=3,a=1}) - \mu_1 \geq \epsilon | N_{3,1}=2) = P( \frac{1}{2} (Z_{1,1} + Z_{3,1}) - \mu_1 \geq \epsilon | Z_{1,1}>Z_{2,2}) $$
This shows $Z$’s and $N_{t,a}$ cannot be independent in general; furthermore, conditioning on $N$ may alter $Z$'s distribution, e.g. making $Z$'s no longer have the same mean.
Hence, I believe this is an error in the notes. To still work out an overall bound for the risk, we need to consider other approaches altogether. e.g.
https://tor-lattimore.com/downloads/book/book.pdf (page 104, theorem 7.1)
https://dash.harvard.edu/bitstream/handle/1/37364663/ZHAO-SENIORTHESIS-2020.pdf?sequence=1&isAllowed=y (page 22, theorem 2.7)
Thanks to @JohnPrice's pointer, it's possible to show that $$ Pr\left(\frac{1}{N_{t,a}}\sum_{s=1}^t Z_{s,a} I_{A_s=a} - \mu_a \geq \sqrt{\frac{\log(1/\delta)k}{2N_{t,a}^2}} \right) \leq \delta^{\frac{k}{t}}, \forall k $$ (this answer is adapted from Lemma 6.2).
Proof: Define $X_i:=(Z_{i,a}-\mu_a)\mathbb{I}_{A_{i}=a}$. Define $\mathcal{F}_{<i}$ to be the natural filtration formed by all actions $A_j$ performed before time $i$. We claim that $\forall i, E[X_i|\mathcal{F}_{<i}]=0$. To see this, notice that the UCB algorithm for choosing the next arm to pull is deterministic when conditioned on all actions in the past; if the next arm is $a$, then the indicator equals $1$ and $E[Z_{i,a}-\mu_a]=0$ ($Z_{i,a}$ is independent of the past), if not, the indicator equals $0$.
Hence, $\{X_i\}$ is a martingale difference sequence and by Azuma-Hoeffding (general form), $$Pr\left( \sum_{i=1}^t X_i \geq \epsilon:=\sqrt{\frac{\log(1/\delta)k}{2}} \right)\leq \exp\left( - \frac{2\epsilon^2}{t} \right) = \delta^{\frac{k}{t}}$$ (where we used $\sum_{i=1}^t X_i^2 \leq t$ in Azuma-Hoeffding). Then, observe that $\left\{ \sum_{i=1}^t X_i \geq \sqrt{\frac{\log(1/\delta)k}{2}} \right\} = \left\{ \frac{1}{N_{t,a}}\sum_{s=1}^t Z_{s,a} I_{A_s=a} - \mu_a \geq \sqrt{\frac{\log(1/\delta)k}{2N_{t,a}^2}} \right\}$
PS: The proof in Lemma 6.2 claimed that it used Azuma-Hoeffding but there is one key step where it bounds $\sum_{i=1}^t X_i^2$ by $N_{t,a}$ and uses $N_{t,a}$ (which is random) in Azuma-Hoeffding, which is unjustified.