Does there exist an MDP policy with this property?

25 Views Asked by At

Consider a discrete-time MDP with finite states and actions. For any policy $\pi$ and state $s$, let $u_t^{\pi}(s)$ be the expected total reward for using $\pi$ at times $t, t+1, ..., N$ if the process occupies $s$ at time $t$. If $\pi$ is deterministic, it can be shown that $u_t^{\pi}(s) = r_t(s, d_t^{\pi}(s)) + \sum_{j \in S}p_t(j | s, d_t^{\pi}(s))u_{t+1}^{\pi}(j) $ for any state $s$, where $r_t(s, d_t^{\pi}(s))$ is the immediate reward for choosing the action prescribed by $\pi$ at time $t$ for state $s$, i.e $d_t^{\pi}(s)$.

My question is as follows. Let $\pi$ be a deterministic policy, and let $w = r_t(s, d_t(s)) + \sum_{j \in S}p_t(j | s, d_t^{\pi}(s))g_j$, where each $g_j$ is in $\cup_{\pi'} \{u_{t+1}^{\pi'}(j)\}$ (the union is over deterministic policies). Can we construct a policy, say $\pi'$, such that $w = u_t^{\pi'}(s)$? If all the $g_j$'s are associated with the same policy, then the answer is yes, but what if they are not?

The reason I am asking is that an article tacitly assumes that any quantity of the same form as $w$ must be the expected total reward for using some policy over $t, ..., N$.

Thank you.