Uniqueness of optimal stopping time for a one-armed bandit to reach break-even?

40 Views Asked by Bumbble Comm At 29 Mar 2026 - 4:26

This question is based on the derivation of the Gittins index in Weber's article On the Gittins index for multiarmed bandits.

Consider a one-armed bandit consisting of a Markov process $X$ and a nonnegative and bounded reward function $R$. The bandit starts in a state $X(0) = x$ at time $t = 0$. Each time $t$ the bandit is played a playing charge $c$ is paid, the bandit arrives in a new state $X(t)$, and a reward $R(X(t))$ is received. The bandit is stopped after $\tau$ plays, where $\tau$ is a stopping time, and this results in an expected discounted gain \begin{equation*} V_c^\tau(x) = \mathbb{E}\bigg[\sum^{\tau}_{s=1}\beta^s(R(X(s)) - c)\bigg|X(0) = x\bigg] \end{equation*} where $0 < \beta < 1$ is a discount factor. The goal is to find a stopping time $\tau$ that is optimal, i.e., satisfies \begin{equation*} V_c^\tau(x) = V_c^*(x) := \sup_{\sigma}{V_c}^\sigma(x), \end{equation*} where the supremum is over all stopping times $\sigma$.

For sufficiently low playing charges $c$ the optimal expected discounted gain $V_c^*(x)$ will be positive, meaning it will be profitable to play ($\tau\neq0$), and for sufficiently high charges $V_c^*(x)$ will be zero as it will be best to not play ($\tau=0$). This motivates the definition of a "fair charge" \begin{equation*} \gamma(x) = \inf\{c\in\mathbb{R}:V_c^*(x)=0\}. \end{equation*} Suppose that the playing charge $c$ equals the fair charge $\gamma(x)$, then the optimal expected discounted gain $V_c^*(x)$ will equal zero and not playing ($\tau=0$) will be optimal.

Question: Is this optimal stopping time of not playing the unique solution, or does there also exist an almost surely non-zero optimal stopping time $\tau$?

I am motivated by this question from what I read on the second page of Weber's article, that in this situation where $c = \gamma(x)$ "the gambler may continue to play the bandit as a fair game for further epochs", suggesting that though not playing is optimal it is not the only optimal solution but there will be other optimal solutions that do involve playing.

Is this so, and how do I show this?

I managed to derive some properties of the optimal expected discounted gain $V_c^*(x)$ as a function of the charge $c$, such as being decreasing and Lipschitz continuous in $c$, and that $V_c^*(x)=0$ if and only if $c\geq\gamma(x)$. The trivial optimal solution of not playing followed from simple arguments. But what about the existence of non-trivial solutions? Any suggestions are appreciated!

Edit: Some further ideas.

If the playing charge $c$ equals the fair charge $\gamma(x)$ then $V_c^*(x)=0$ and \begin{equation*} 0 = V_c^*(x) = \sup_{\tau > 0} V_c^{\tau}(x) = \sup_{\tau > 0}\bigg( \mathbb{E}\bigg[\sum^{\tau}_{s=1}\beta^s R(X(s))\bigg|X(0)=x\bigg] -c \mathbb{E}\bigg[\sum^{\tau}_{s=1}\beta^s \bigg|X(0)=x\bigg] \bigg), \end{equation*} hence \begin{equation*} c = \sup_{\tau > 0} \frac{ \mathbb{E}\big[\sum^{\tau}_{s=1}\beta^s R(X(s))|X(0)=x\big] }{ \mathbb{E}\big[\sum^{\tau}_{s=1}\beta^s |X(0)=x\big] }. \end{equation*} In the last equation the right-hand side I can get arbitrarily close to the fair charge $c$ by choosing the stopping time $\tau$ appropriately, and thus arbitrarily close to optimality. All stopping times $\tau$ that I can choose in this way satisfy $\tau > 0$. Could this imply the existence of an optimal stopping time $\tau$ satisfying $\tau > 0$?

Original Q&A

Uniqueness of optimal stopping time for a one-armed bandit to reach break-even?

Related Questions in MARKOV-CHAINS

Related Questions in MARKOV-PROCESS

Related Questions in STOPPING-TIMES

Trending Questions

Popular # Hahtags

Popular Questions