computing expectation of two arm bandit

97 Views Asked by Bumbble Comm At 03 Apr 2026 - 9:42

assume you have a two arm bandit with one arm having a fixed, known probability of payoff $p = 0.6$ and another having an unknown payoff $q$, which is drawn uniformly from $[0,1]$. Each game the player gets to pull the bandits $N$ many times, $q$ is revealed to the player at the start of the game. The player will obviously choose the bandit with the higher probability, so the rule is: choose bandit with $\max(p,q)$. What is the expected value of payoff here (if one gets 1 unit payoff per successful pull) at the start of each game?

intuitively, 60% of the time the user will end up with $q \leq p$, and will choose the $p$ bandit. In the remaining 40% of the times, $q > p$, and user will choose $q$, therefore the expected payoff must be greater than 60%.

I'm trying to calculate $E[\max(p,q)]$ formally. I tried this:

$E[\max(p,q)] = \int\max(p,q) \times q \times 1 dq$ (we assume payoff of $1$ which drops out)

since $q \in [0,1]$ and $p$ is fixed and known in advance, we only need to integrate wrt $q$ on $[0,1]$:

$$ E[\max(p,q)] = \int_{0}^{1}\max(p,q) \times q dq \\ $$

yielding:

$$ E[\max(p,q)] = \int_{0}^{0.6} \max(p,q)qdq + \int_{0.6}^{1}\max(p,q)qdq = 0.6(q^2)\big|_{0}^{0.6} + (q^2)\big|_{0.6}^{1} $$

which looks wrong (textbook says it is 0.68 and gives no explanation). Can you show the correct, formal full derivation using expectations? and also give intuition for getting the answer without formal calculation?

Original Q&A

There are 1 best solutions below

Bumbble Comm On 08 Jan 2015 - 4:39

It is best to clearly define the random variable of interest. Here, it is the payoff $X$ consisting of the sum of $N$ independent trials $I_1, I_2, \ldots, I_N$, where each $I_k$ is drawn from a Bernoulli distribution with probability $$\Pr[I_k = 1] = \max(0.6,q), \quad k = 1, 2, \ldots, N.$$ Therefore, $$X \sim \operatorname{Binomial}(N, \max(0.6,q)).$$ But $q \sim \operatorname{Uniform}(0,1)$ is itself a random variable, so the conditional expectation, given $q$, is $$\operatorname{E}[X \mid q] = \begin{cases} Nq, & q > 0.6, \\ 0.6N, & q \le 0.6. \end{cases}$$ Hence the unconditional expectation of $X$ is given by the iterated (or double) expectation formula: $$\begin{align*} \operatorname{E}[X] &= \operatorname{E}[\operatorname{E}[X \mid q]] \\ &= \operatorname{E}[Nq \mid q > 0.6]\Pr[q > 0.6] + (0.6 N)\Pr[q \le 0.6] \\ &= N\cdot \frac{0.6 + 1}{2} (1 - 0.6) + (0.6N)(0.6) \\ &= 0.68N. \end{align*}$$

computing expectation of two arm bandit

There are 1 best solutions below

Related Questions in PROBABILITY

Related Questions in STATISTICS

Related Questions in PROBABILITY-THEORY

Related Questions in STATISTICAL-INFERENCE

Trending Questions

Popular # Hahtags

Popular Questions