This is a homework example for reinforcement (machine) learning that I have to solve. I just would like to know if my thoughts on this are somehow correct.
Let $\pi$ be a policy and $V^\pi(s)$ the value function to calculate a reward in any state $s$ under such a policy $\pi$.
Let us also define the partial ordering of policies. We write $\pi \geq \pi'$, meaning that policy $\pi$ is better or equal than another policy $\pi'$ if:
$$V^\pi(s) \geq V^{\pi'}(s) \text{, for every state } s \in S$$
Further a policy $\pi^*$ is optimal if
$$\pi^* \geq \pi \text{, }\forall \pi \Leftrightarrow V^{\pi^*}(s) = max_\pi V^\pi(s) \text{, } \forall s \in S$$
where
$$V^{\pi^*}(s) = max_{a \in A_s}Q^{\pi^*}(s,a)$$
is the Bellman optimality equation for $V^{\pi^*}$ and $Q^{\pi^*}(s, a)$ being the action value function returning the value of an action $a \in A$ in a given state $s \in S$ under an optimal policy $\pi^*$.
My script also shows e.g. that there exists at most one function $V^{\pi^*}: S \rightarrow \mathbb{R}$ and then states a corollary:
Every policy $\pi$ for which $V^\pi$ satisfies the Bellman optimality equations $$V^{\pi^*}(s) = max_{a \in A_s}Q^{\pi^*}(s,a)$$ is optimal
and this is what I have to prove.
Imho it is clear that this corollary has to follow due to the definitions above. To say that such a $\pi$ is not optimal, despite it is satisfying the Bellman optimality equations breaks the definitions in my eyes.
That is why it is not clear to me what I have to prove here. I don't think that I have to prove this in a strict mathematical and/or formal way but I guess just saying that this breaks the definitions might not be enough.
So how could I prove this corollary? Please don't give full answers; hints should do it!
Since you requested we provide only hints: