We know that in MDP the reward of an action is required to be immediately known. What if the reward of an action in MDP(Markov Decision Process) is dependent on the later state.
For example, consider a queue containing $m$ servers with exponential service rate $\mu_i, i = 1,\dots,m$, which faces customers follow Poisson arrival with parameter $\lambda$. As a system manager, I need to decide which server let the arrival customer join so that minimize the total average waiting time. Here, the waiting time of a customer will not be known immediately after I make the 'dispatching' decision.
Is this question can be formulated as a MDP? I have found some papers which discussed the similar questions but the objective is not quite same. For example, some of them consider minimizing the average length of queue, or maximize the total completed amounts.