This is a question I have from stochastic control. I know that in general $\underset{y\in \mathcal Y} \sup \mathbb E\left[f(X,y)\right]\leqslant \mathbb E\left[\underset{y\in \mathcal Y} \sup f(X,y)\right]$. I normally would assume that the reverse inequality does not always hold, just like in typical inequalities, such as with Jensen's inequality, but in the proof below, I see that a similar equality is proven, which makes it seem as if $\underset{y\in \mathcal Y} \sup \mathbb E\left[f(X,y)\right] = \mathbb E\left[\underset{y\in \mathcal Y} \sup f(X,y)\right]$ holds, but I don't fully understand/agree with part of the proof.
In the following, $\mathbb E\left[f(X_T)|\mathcal F_t;\pi\right]$, is saying that the stochastic process $X_t$ is controlled by the process $\pi$, not that $\pi$ is assumed known. The argument is from Markov Decision Processes and Dynamic Programming at the top of page 9.
In the solution for Bellman's principle (discrete time), I have seen that the following is done, for admissible control processes $\pi$: \begin{equation} \underset{\pi} \sup \mathbb E\left[V^{\pi}(t+1,X_{t+1})|X_t=x; \pi\right] = E\left[\underset{\pi} \sup V^{\pi}(t+1,X_{t+1})|X_t=x; \pi\right] \end{equation} which is proven by showing both inequalities hold. So: \begin{equation} \underset{\pi} \sup \mathbb E\left[V^{\pi}(t+1,X_{t+1})|X_t=x; \pi\right]\leqslant E\left[\underset{\pi} \sup V^{\pi}(t+1,X_{t+1})|X_t=x; \pi\right] \end{equation} which I can see follows from $\underset{y\in \mathcal Y} \sup \mathbb E\left[f(X,y)\right]\leqslant \mathbb E\left[\underset{y\in \mathcal Y} \sup f(X,y)\right]$
Then the reverse inequality is proven, where $\pi^* = \underset{\pi} {\text{argmax}}\left[\underset{\pi}\sup V^{\pi}(t+1,X_{t+1})\right]$ \begin{equation} E\left[\underset{\pi} \sup V^{\pi}(t+1,X_{t+1})|X_t=x; \pi\right] = E\left[V^{\pi^*}(t+1,X_{t+1})|X_t=x; \pi^* \right]\leqslant \underset{\pi} \sup \mathbb E\left[V^{\pi}(t+1,X_{t+1})|X_t=x; \pi\right] \end{equation} which follows by the definition of supremum.
I am wondering how come we can't just do this same procedure for $\underset{y\in \mathcal Y} \sup \mathbb E\left[f(X,y)\right]\leqslant \mathbb E\left[\underset{y\in \mathcal Y} \sup f(X,y)\right]$, and show that equality holds there as well. The issue seems to be with the step of the reverse inequality and I also can think of some simple examples that seem to defy the equality.
For example: There is just one time step, from $T-1$ to $T$, and the probability space has just $3$ points: $\{ \omega_1,\omega_2,\omega_3\}$, all with equal likelihood. There is a function $f(y,X_T(\omega))$, for $y \in \{0,1\}$, such that whenever $y = 0$, $f(0,X_T(\omega)) = 30$, and that when $y = 1$, $f(1,X_T(\omega_1)) = -3000$, $f(1,X_T(\omega_2)) = -3000$, and $f(1,X_T(\omega_3)) = 300$.
So if we wanted to calculate $\underset{y\in \{0,1\}} \sup \mathbb E\left[f(y,X_T(\omega))\right]$, we maximize the expectation over choices of $y$, so the choice would be $y = 0$, for a value of $30$, regardless of the state of the world. And for $ \mathbb E\left[ \underset{y\in \{0,1\}} \sup f(y,X_T(\omega))\right]$, we choose $y = 1$ when the state of the world is $\omega_3$ and $y = 0$ otherwise, and so since each possibility has equal weighting, the value of the expression is $\frac{300 + 30 + 30}{3} = 120$, which doesn't match the $30$ from $\underset{y\in \{0,1\}} \sup \mathbb E\big[f(y,X_T(\omega))\big]$.
So if we consider the control process to be the choice of $y$, then this poses a contradiction with the proof of the Bellman principle. One issue I see with my example is that the supremum on the outside of the expectation seems to not be able to 'look into the future' while the supremum on the inside of the expectation can. I'm not sure how to reconcile this.
So I think my main questions are:
Can my simple (possibly incorrect) example be reconciled with how the proof for the Bellman principle was done?
What exactly $\underset{\pi} \sup V^{\pi}(X_{t+1})$ means and how it can be evaluated. Is it a random variable, where the maximum of $V^{\pi}(X_{t+1})$ is chosen across policies, depending on the state of the world?
Any clarification would be appreciated. I've been trying to get this answered for a while now. I'll award the bounty and correct answer even for just a link that'll help explain things! Thanks a lot!
In my view there is no real confusion in your notation if you consistently use a semicolon to distinguish from conditioning, but for posterity let's agree that $V^\pi$ is the value function under the policy $\pi$. Under mild conditions the interchange of the supremum and integral is permitted, and the equality does hold -- see here, Proposition 5.1. Of course, it need not hold in general.
You are probably confused because if that's the case, why isn't the Bellman equation trivial? The reason is it provides you with an explicit relationship between the process and the optimal control $\pi^\star$. Showing that equality can be achieved in the "$\sup\int \leq \int\sup$" inequality is related to the so-called a verification argument for the value function $V$. This is easier to see in deterministic control -- see the middle of page 11 for the first example Google brought up.
To appreciate the Bellman equation, it might be best to work out a simple example. Roughly speaking, Bellman's optimality principle corresponds to "dynamic programming" in computer science, for which there are many worked examples online -- here's one (the link may be 'insecure', but it looks fine to me).