Here is a question asked on MSE a few years ago: Optimal stopping in red vs black card game deck of 52 cards
Imagine you are playing a card game you start with a well shuffled deck of $52$ cards, containing $26$ red cards and $26$ black cards, stacked face down. You have a sequence of turns, $52$ possible turns in total, each turn you either pull the top card and turn it over, or you quit. If you pull a red card you win $1$, and if you pull a black card you lose $1$. If you played all $52$ turns without stopping you are guaranteed to break even. What is the optimal stopping strategy?
There is an answer given there by quasi: https://math.stackexchange.com/a/2544082/834963
Let $v(r,b)$ be the expected value of the game for the player, assuming optimal play, if the remaining deck has $r$ red cards and $b$ black cards.
Then $v(r,b)$ satisfies the recursion $$ v(r,b) = \begin{cases} 0&\;\text{if}\;r=0\\[4pt] r&\;\text{if}\;b=0\;\text{and}\;r>0\\[4pt] \max(0,f(r,b))&\;\text{if}\;r,b>0\\[4pt] \end{cases} $$ where $$ f(r,b) = \left({\small{\frac{r}{r+b}}}\right)(1+v(r-1,b)) + \left({\small{\frac{b}{r+b}}}\right)(-1+v(r,b-1)) $$ The stopping rule is simple: Stop when $v(r,b)=0$.
To explain the recursion . . .
If $r,b>0$, and the player elects to play a card, then:
- The revealed card is red with probability ${\large{\frac{r}{r+b}}}$, and in that case, the player gets a score of $+1$, and the new value is $v(r-1,b)$.$\\[4pt]$
- The revealed card is black with probability ${\large{\frac{b}{r+b}}}$, and in that case, the player gets a score of $-1$, and the new value is $v(r,b-1)$.
But the player always has the option to quit, hence, if $r,b>0$, we get $v(r,b) = \max(0,f(r,b))$.
Implementing the recursion in Maple, the value of the game is $$v(26,26) = \frac{41984711742427}{15997372030584}\approx 2.624475549$$ and the optimal stopping strategy is as follows . . .
- If $24 \le b \le 26$, play while $r \ge b - 5$.$\\[4pt]$
- If $17 \le b \le 23$, play while $r \ge b - 4$.$\\[4pt]$
- If $11 \le b \le 16$, play while $r \ge b - 3$.$\\[4pt]$
- If $6 \le b \le 10$, play while $r \ge b - 2$.$\\[4pt]$
- If $3 \le b \le 5$, play while $r \ge b - 1$.$\\[4pt]$
- If $1 \le b \le 2$, play while $r \ge b$.$\\[4pt]$
- If $b = 0$, play while $r > 0$.
However, given that this question was asked as part of a job interview, I don't think it's very realistic to get $2.62$ in a short amount of time. I was wondering if there a quick way to:
- Find the optimal stopping rule in terms of maximizing expected payoff.
- Get an estimate close to the actual answer of around $2.62$ without laborious recursive calculation or writing a program.
Look for the 'pivot points' and make continuous approximations
Here's part of an answer...
If we do some handwaving and approximations we can get something more compact without actually doing the computations.
First, simplifying the notation of $v$ and $f$, we can combine into $q$ notation: q depends on both state and action, and is the expected value of taking the proposed action in the state and then acting optimally thereafter.
Using Bellman's optimal q equation, writing $q(r, b)$ for the quality of the action $\operatorname{play}$ in state $(r, b)$ (the quality of the action $\operatorname{stop}$ is always $0$)
$$q(r, b) = \frac {r} {r+b}\left(1 + \max(0, q(r-1, b))\right) + \frac {b} {r+b}\left(-1 + \max(0, q(r, b-1))\right)$$
The point of the question is to determine when $q(r, b) > 0$ (i.e. we should take the $\operatorname{play}$ action) and also what is $q(26, 26)$. What follows makes some progress toward the former.
So the criterion is
$$q(r, b) > 0$$
First handwave: in the region of $0$, less $r$ is 'bad', less $b$ is 'good'
Those $\max$ expressions sure make this fiddly!
Let's look for 'pivot points' where $q \simeq 0$.
Assuming $q$ is roughly continuous, when we're near (in $(r, b)$-space) $q(r, b) = 0$ we should expect that reducing $r$ pushes us into negative territory. And we should assume that reducing $b$ pushes us into positive territory.
So 'near a pivot' we can say
$$\max(0, q(r-1, b)) = 0$$ $$\max(0, q(r, b-1)) = q(r, b-1)$$
and also
$$\frac {\partial q} {\partial r}(r, b) > 0$$ $$\frac {\partial q} {\partial b}(r, b) < 0$$
Second handwave: small changes in $r$, and $b$ have roughly linear effects
This is basically a Taylor assumption. I don't justify it here, but it seems intuitively sensible for the domain in question.
Now we can say things like, near a pivot where $q(r, b) = 0$
$$\max(0, q(r-1, b)) = \max\left(0, q(r, b) - \frac {\partial q} {\partial r}(r, b)\right) = 0$$ $$\max(0, q(r, b-1)) = \max\left(0, q(r, b) - \frac {\partial q} {\partial b}(r, b)\right) = -\frac {\partial q} {\partial b}(r, b)$$
Putting it together and 'solving'
Now we have, near a 'pivot point' where $q(r, b) \simeq 0$,
$$q(r, b) \simeq \frac r {r+b} + \frac b {r+b}\left(-1 - \frac {\partial q} {\partial b}(r, b)\right)$$
So we have a first order partial differential equation for $q$ in $b$
$$0 \simeq r + b \left(-1 - \frac {\partial q} {\partial b}\right)$$
Separating,
$$\frac {\partial q} {\partial b} = -1 + \frac r b$$
and solving
$$q = - b + r \ln b + c_r(r)$$
I'm not sure what handwaves are required, but intuitively we should be able to do some similar tricks to get
$$q = r + c_b(b)$$
and from there
$$q = r - b + r\ln b + c$$
which yields the criterion (for some $c$ to be determined, perhaps by inspecting small $r$, $b$ cases)
Presumably an appropriate $c$ is small, maybe zero, since for small $r$ and $b$, $r-b$ is a good criterion.
Finding $q(26, 26)$
The above solution should not be expected to extrapolate usefully 'away from pivots', since many of the assumptions require being in the region of a pivot. So we can't use this to answer the other part of the question.