When is supremum of the expectation equal to the expectation of the supremum using control processes

1k Views Asked by At

This is a question I have from stochastic control. I know that in general $\underset{y\in \mathcal Y} \sup \mathbb E\left[f(X,y)\right]\leqslant \mathbb E\left[\underset{y\in \mathcal Y} \sup f(X,y)\right]$. I normally would assume that the reverse inequality does not always hold, just like in typical inequalities, such as with Jensen's inequality, but in the proof below, I see that a similar equality is proven, which makes it seem as if $\underset{y\in \mathcal Y} \sup \mathbb E\left[f(X,y)\right] = \mathbb E\left[\underset{y\in \mathcal Y} \sup f(X,y)\right]$ holds, but I don't fully understand/agree with part of the proof.


In the following, $\mathbb E\left[f(X_T)|\mathcal F_t;\pi\right]$, is saying that the stochastic process $X_t$ is controlled by the process $\pi$, not that $\pi$ is assumed known. The argument is from Markov Decision Processes and Dynamic Programming at the top of page 9.

In the solution for Bellman's principle (discrete time), I have seen that the following is done, for admissible control processes $\pi$: \begin{equation} \underset{\pi} \sup \mathbb E\left[V^{\pi}(t+1,X_{t+1})|X_t=x; \pi\right] = E\left[\underset{\pi} \sup V^{\pi}(t+1,X_{t+1})|X_t=x; \pi\right] \end{equation} which is proven by showing both inequalities hold. So: \begin{equation} \underset{\pi} \sup \mathbb E\left[V^{\pi}(t+1,X_{t+1})|X_t=x; \pi\right]\leqslant E\left[\underset{\pi} \sup V^{\pi}(t+1,X_{t+1})|X_t=x; \pi\right] \end{equation} which I can see follows from $\underset{y\in \mathcal Y} \sup \mathbb E\left[f(X,y)\right]\leqslant \mathbb E\left[\underset{y\in \mathcal Y} \sup f(X,y)\right]$

Then the reverse inequality is proven, where $\pi^* = \underset{\pi} {\text{argmax}}\left[\underset{\pi}\sup V^{\pi}(t+1,X_{t+1})\right]$ \begin{equation} E\left[\underset{\pi} \sup V^{\pi}(t+1,X_{t+1})|X_t=x; \pi\right] = E\left[V^{\pi^*}(t+1,X_{t+1})|X_t=x; \pi^* \right]\leqslant \underset{\pi} \sup \mathbb E\left[V^{\pi}(t+1,X_{t+1})|X_t=x; \pi\right] \end{equation} which follows by the definition of supremum.


I am wondering how come we can't just do this same procedure for $\underset{y\in \mathcal Y} \sup \mathbb E\left[f(X,y)\right]\leqslant \mathbb E\left[\underset{y\in \mathcal Y} \sup f(X,y)\right]$, and show that equality holds there as well. The issue seems to be with the step of the reverse inequality and I also can think of some simple examples that seem to defy the equality.

For example: There is just one time step, from $T-1$ to $T$, and the probability space has just $3$ points: $\{ \omega_1,\omega_2,\omega_3\}$, all with equal likelihood. There is a function $f(y,X_T(\omega))$, for $y \in \{0,1\}$, such that whenever $y = 0$, $f(0,X_T(\omega)) = 30$, and that when $y = 1$, $f(1,X_T(\omega_1)) = -3000$, $f(1,X_T(\omega_2)) = -3000$, and $f(1,X_T(\omega_3)) = 300$.

So if we wanted to calculate $\underset{y\in \{0,1\}} \sup \mathbb E\left[f(y,X_T(\omega))\right]$, we maximize the expectation over choices of $y$, so the choice would be $y = 0$, for a value of $30$, regardless of the state of the world. And for $ \mathbb E\left[ \underset{y\in \{0,1\}} \sup f(y,X_T(\omega))\right]$, we choose $y = 1$ when the state of the world is $\omega_3$ and $y = 0$ otherwise, and so since each possibility has equal weighting, the value of the expression is $\frac{300 + 30 + 30}{3} = 120$, which doesn't match the $30$ from $\underset{y\in \{0,1\}} \sup \mathbb E\big[f(y,X_T(\omega))\big]$.

So if we consider the control process to be the choice of $y$, then this poses a contradiction with the proof of the Bellman principle. One issue I see with my example is that the supremum on the outside of the expectation seems to not be able to 'look into the future' while the supremum on the inside of the expectation can. I'm not sure how to reconcile this.


So I think my main questions are:

  1. Can my simple (possibly incorrect) example be reconciled with how the proof for the Bellman principle was done?

  2. What exactly $\underset{\pi} \sup V^{\pi}(X_{t+1})$ means and how it can be evaluated. Is it a random variable, where the maximum of $V^{\pi}(X_{t+1})$ is chosen across policies, depending on the state of the world?

Any clarification would be appreciated. I've been trying to get this answered for a while now. I'll award the bounty and correct answer even for just a link that'll help explain things! Thanks a lot!

2

There are 2 best solutions below

1
On

In my view there is no real confusion in your notation if you consistently use a semicolon to distinguish from conditioning, but for posterity let's agree that $V^\pi$ is the value function under the policy $\pi$. Under mild conditions the interchange of the supremum and integral is permitted, and the equality does hold -- see here, Proposition 5.1. Of course, it need not hold in general.

You are probably confused because if that's the case, why isn't the Bellman equation trivial? The reason is it provides you with an explicit relationship between the process and the optimal control $\pi^\star$. Showing that equality can be achieved in the "$\sup\int \leq \int\sup$" inequality is related to the so-called a verification argument for the value function $V$. This is easier to see in deterministic control -- see the middle of page 11 for the first example Google brought up.

To appreciate the Bellman equation, it might be best to work out a simple example. Roughly speaking, Bellman's optimality principle corresponds to "dynamic programming" in computer science, for which there are many worked examples online -- here's one (the link may be 'insecure', but it looks fine to me).

0
On
  1. Can my simple (possibly incorrect) example be reconciled with how the proof for the Bellman principle was done?
  2. What exactly $\sup_{\pi}V^{\pi}(t,x)$ means and how it can be evaluated. Is it a random variable, where the maximum of $\sup_{\pi}V^{\pi}(t,x)$ is chosen across policies, depending on the state of the world?

For 1, you already spotted the key point in your example:

One issue I see with my example is that the supremum on the outside of the expectation seems to not be able to 'look into the future' while the supremum on the inside of the expectation can. I'm not sure how to reconcile this.

Your control $\pi_{.}$ (the dot signifies this is a stochastic process) needs to be a stochastic process which is adapted to the filtration $\mathcal{F}_t$. So then your control cannot look into the future (such a control is not $\mathcal{F}_t$ measurable). The set of all processes which are $\mathcal{F}_t$ adapted is called the admissible set $\mathcal{U}_{\textrm{ad}}$, and your supremum is over processes in that set. I cannot quite understand why your example doesn't work but I think $y$ is not adapted correctly, I will need to come back to this later.

For 2, the definition of the value function $V(t,x)$ depends on your problem. If your objective is to maximise the expected value of some terminal cost of your state process i.e. $\mathbb{E}[f(X_T^{\pi_{.}})]$ it becomes

$$ V(t,x) = \sup_{\pi_{.} \in \mathcal{U}_{\textrm{ad}}}V^{\pi_{.}}(t,x)$$ $$ V^{\pi_{.}}(t,x) = \mathbb{E}[ f(X_T^{\pi_{.}}) | \mathcal{F}_t] = \mathbb{E}[ f(X_T^{\pi_{.}}) | X_t = x]$$

Lets digest what is happening here.

  • You should read the above equations as: pick an admissible control $\pi_{.}$ and run the state process $X_{.}^{\pi_{.}}$ under it starting from time $t$ and initial value $X_t = x$ up to time $T$. Compute the expected terminal cost $V^{\pi_{.}}(t,x) = \mathbb{E}[f(X_{T}^{\pi_{.}}) | X_t = x ]$. Note the value $V^{\pi_{.}}(t,x)$ down. Now repeat for all other admissible controls and finally take the maximum of all these $V^{\pi_{.}}(t,x)$. For a given pair of inputs $t, x$, this should equal $V(t,x)$.
  • Thus, $V(t,x)$ is a deterministic function of $t$ and $x$. All stochasticity is removed: we have taken an expectation and then a supremum over all possible controls. This "fact" is not obvious but depends on the fact that the process $X$ is Markov, and that our cost function looks "nice".
  • $\mathcal{F}_t$ is the same as $X_t = x$ since your process is Markov: it doesn't care about the past, just the current value i.e. our starting condition $X_t = x$.
  • The process $X$ depends explicitly on the control $\pi_{.}$, which is why it is denoted as $X_{.}^{\pi_{.}}$
  • Our controls $\pi_{.}$ are all $\mathcal{F}_t$ adapted, so they don't look into the future.

Now, for this particular problem and given that $X$ is markov, we can claim that the Dynamic Programming Principle (Bellman's principle) holds, and thus we can write out an HJB equation for this problem and then try to solve it. Solving it we should produce an "optimal control" $\pi^{*}_{.}$ under which our state process achieves the maximum objective. If we are successful we can then derive a closed form expression for $V(t,x)$, and if possible we should verify this solution is indeed correct i.e. is the control actually is optimal (this involves several complicated steps which I omit but you can find that in other sources).