Proof: $\forall \pi : V^\pi(s) = max_{a \in A_s} Q^\pi(s,a) \forall s \in S \Rightarrow \pi \text{ is optimal}$

174 Views Asked by At

This is a homework example for reinforcement (machine) learning that I have to solve. I just would like to know if my thoughts on this are somehow correct.

Let $\pi$ be a policy and $V^\pi(s)$ the value function to calculate a reward in any state $s$ under such a policy $\pi$.

Let us also define the partial ordering of policies. We write $\pi \geq \pi'$, meaning that policy $\pi$ is better or equal than another policy $\pi'$ if:

$$V^\pi(s) \geq V^{\pi'}(s) \text{, for every state } s \in S$$

Further a policy $\pi^*$ is optimal if

$$\pi^* \geq \pi \text{, }\forall \pi \Leftrightarrow V^{\pi^*}(s) = max_\pi V^\pi(s) \text{, } \forall s \in S$$

where

$$V^{\pi^*}(s) = max_{a \in A_s}Q^{\pi^*}(s,a)$$

is the Bellman optimality equation for $V^{\pi^*}$ and $Q^{\pi^*}(s, a)$ being the action value function returning the value of an action $a \in A$ in a given state $s \in S$ under an optimal policy $\pi^*$.

My script also shows e.g. that there exists at most one function $V^{\pi^*}: S \rightarrow \mathbb{R}$ and then states a corollary:

Every policy $\pi$ for which $V^\pi$ satisfies the Bellman optimality equations $$V^{\pi^*}(s) = max_{a \in A_s}Q^{\pi^*}(s,a)$$ is optimal

and this is what I have to prove.


Imho it is clear that this corollary has to follow due to the definitions above. To say that such a $\pi$ is not optimal, despite it is satisfying the Bellman optimality equations breaks the definitions in my eyes.

That is why it is not clear to me what I have to prove here. I don't think that I have to prove this in a strict mathematical and/or formal way but I guess just saying that this breaks the definitions might not be enough.

So how could I prove this corollary? Please don't give full answers; hints should do it!

1

There are 1 best solutions below

0
On BEST ANSWER

Since you requested we provide only hints:

  1. Assume opposite
  2. Write down what the opposite means (in a formal way)
  3. Insert Bellman optimality equation into formalized opposite
  4. Insert Action-Value Function definition into formalized opposite
  5. Spot contradiction
  6. q.e.d.?