How to get a logarithmic dependence on $K$ for a follow the leader algorithm in i.i.d. stochastic $K$-armed bandits?

Question

How to get a logarithmic dependence on $K$ for a follow the leader algorithm in i.i.d. stochastic $K$-armed bandits?

93 Views Asked by Bumbble Comm At 27 Mar 2026 - 1:03

I want to prove that, using a follow the leader algorithm, the problem of identifying the best arm in a stochastic $K$-armed bandit problem with full-feedback, where what we pay is the gap of the action we choose, shows a worst-case $\sqrt{\log(K)}$ dependence on $K$. Let me formalize the problem for those who are not acquainted with the bandits terminology.

Let $K \in \mathbb{N}$ with $K\ge2$ and $1 > p_1 >p_2\ge\dots\ge p_K>0$.

Suppose that $(X_{i,t})_{i \in \{1,\dots,K\}, t \in \mathbb{N}}$ is an independent family of random variables such that, for each $i \in\{1,\dots,K\}$, the sequence $(X_{i,t})_{t \in \mathbb{N}}$ is an i.i.d. sequence of $[0,1]$-valued random variables, where each element in the sequence has expectation $p_i$.

Let $T \in \mathbb{N}$ be a time horizon.

Let $I_T$ be a $\{1,\dots,K\}$-valued random variable whose index maximizes the empirical means of the $X$s up to time $T$, i.e., $$I_T \in \operatorname{argmax}_{i \in \{1,\dots,K\}} \frac{1}{T} \sum_{t=t}^T X_{i,t}\;.$$

I'm wondering about how large can be the expected gap $$\mathbb{E} [p_1 -p_{I_T}]$$ in terms of $T$ and $K$.

Specifically, I believe that it should exist a universal constant $c>0$ (independent of $T,K, p_1,\dots,p_K$) such that $$\mathbb{E} [p_1 -p_{I_T}] \le c \cdot \sqrt{\frac{\log(K)}{T}} \;,$$ but, for now, a proof of this fact still eludes me.

So far, I have worked in the following manner: \begin{align*} \mathbb{E} [p_1 -p_{I_T}] &= \sum_{j =2}^K (p_1 -p_{j})\mathbb{P} [I_T = j] \le \sum_{j =2}^K (p_1 -p_{j})\mathbb{P} \bigg[\max_{k \in \{1,\dots,K\}} \frac{1}{T} \sum_{t=1}^T X_{k,t} = \frac{1}{T}\sum_{t=1}^T X_{j,t} \bigg] \\ &\le \sum_{j =2}^K (p_1 -p_{j})\mathbb{P} \bigg[ \frac{1}{T} \sum_{t=1}^T X_{1,t} \le \frac{1}{T}\sum_{t=1}^T X_{j,t} \bigg] \\ &= \sum_{j =2}^K (p_1 -p_{j})\mathbb{P} \bigg[ \frac{1}{T} \sum_{t=1}^T \big((X_{1,t}-p_1)+(p_j-X_{j,t})\big) \le p_j - p_1 \bigg] \\ & \le \sum_{j =2}^K (p_1 -p_{j}) \cdot \exp\bigg(- \frac{(p_1-p_j)^2}{32}\cdot T\bigg) \;, \end{align*} where the last inequality follows from Hoeffding's inequality. Now, this last term is maximized picking $p_2 = \dots = p_K = p_1 - \frac{4}{\sqrt{T}}$, which yields the worst case bound of: $$\mathbb{E} [p_1 -p_{I_T}] \le \frac{4}{\sqrt{e}} \cdot \frac{K-1}{\sqrt{T}},$$ which has the right dependence on $T$ but it is sloppy in $K$ (to get the desired $\sqrt{\log(K)}$ dependence w.r.t. $K$).

Does anyone have a better idea of how to get the improvement dependence on $K$?

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

Here an approach relying on the symmetrization trick (I'm still wondering if we can get home with a simpler approach).

Let $(X_{i,t}')_{i \in \{1,\dots,K\},t\in\mathbb{N}}$ another process with the same distribution as $(X_{i,t})_{i \in \{1,\dots,K\},t\in\mathbb{N}}$, which is also independent of $(X_{i,t})_{i \in \{1,\dots,K\},t\in\mathbb{N}}$. Then: \begin{align*} \mathbb{E} [p_1 -p_{I_T}] &= \mathbb{E} \bigg[\frac{1}{T}\sum_{t=1}^{T} X_{1,t} - \frac{1}{T}\sum_{t=1}^{T} X_{I_T,t} +\frac{1}{T} \sum_{t=1}^{T} X_{I_t,t} -p_{I_T}\bigg] \\ &\le \mathbb{E} \bigg[ \frac{1}{T}\sum_{t=1}^{T}X_{I_t,t} -p_{I_T}\bigg] \\ &= \mathbb{E} \bigg[\sum_{i \in \{1,\dots,K\}}\bigg(\frac{1}{T}\sum_{t=1}^{T} X_{i,t} - \frac{1}{T}\sum_{t=1}^{T} X_{i,t}'\bigg)\mathbb{I}\{I_T=i\}\bigg] \\ &\le \mathbb{E}\bigg[\sup_{i \in \{1,\dots,K\}}\frac{1}{T}\sum_{t=1}^{T} (X_{i,t}-X_{i,t}')\bigg] =: \mathbb{E}\bigg[\sup_{i \in \{1,\dots,K\}}Y_i\bigg] \end{align*} Now, since each $Y_i$ is zero-mean $\frac{\sigma^2}{T}$-subgaussian random variable for $\sigma = 1$ (since for each $i \in \{1,\dots,K\}$ and each $t \in \mathbb{N}$ the random variables $X_{i,t}$ and $X'_{i,t}$ are $[0,1]$-valued and they have the same expected value), we obtain that for each $\lambda > 0$, \begin{align*} \exp\bigg(\lambda \mathbb{E}\bigg[\sup_{i \in \{1,\dots,K\}}Y_i\bigg]\bigg) &\le \sum_{i=1}^K \mathbb{E} \bigg[ \exp(\lambda \cdot Y_i) \bigg] \le K \cdot \exp\bigg(\frac{\lambda^2 \cdot \sigma^2}{2\cdot T}\bigg) \;. \end{align*} Taking logarithms and rearranging we get, for each $\lambda > 0$, $$ \mathbb{E} [p_1 -p_{I_T}] \le \frac{\log(K)}{\lambda} + \lambda \cdot \frac{\sigma^2}{2\cdot T}\;, $$ from which, taking $\lambda = \sqrt{\frac{2 \cdot \log(K)\cdot T}{\sigma^2}}$, we get $$ \mathbb{E} [p_1 -p_{I_T}] \le \sigma \cdot \sqrt{2} \cdot \sqrt{\frac{\log(K)}{T}} = \sqrt{\frac{2 \cdot\log(K)}{T}} \;. $$

How to get a logarithmic dependence on $K$ for a follow the leader algorithm in i.i.d. stochastic $K$-armed bandits?

There are 1 best solutions below

Related Questions in STOCHASTIC-PROCESSES

Related Questions in MACHINE-LEARNING

Related Questions in CONCENTRATION-OF-MEASURE

Trending Questions

Popular # Hahtags

Popular Questions