Deterministic Markov Decision Process with reward in each step affected by past actions

40 Views Asked by At

I encounter an MDP problem described in the title. To be specific, the reward function in each step relys on the past actions with memory length $ L $ for example. Then, we have the reward $ R_{t}=s_{t}\theta^{\top}(a_{t-L},...,a_{t})+\epsilon_{t} $ where the state and action are scalars, i.e., $s,a\in\mathbb{R}$, $ \theta\in\mathbb{R}^{L+1} $ and $\epsilon_{t}$ is the zero-mean noise. The transition is deterministic and for each state action pair $ s_{t},a_{t}\in\mathbb{R} $, there is only one next state $ s_{t+1}\in\mathbb{R} $.

I plan to solve this MDP problem by extending the action defition such that in each step the action is defined as $ A_{t} = (a_{t-L},...,a_{t})^{\top} $. Then, a value iteration may be used for extracting the optimal policy.

Is my plan reasonable? Is there any reference on this kind of special MDP problem? Thanks a lot!