Maximizing sum of returns over multiple MDPs

46 Views Asked by Bumbble Comm At 11 May 2026 - 6:22

Let $\triangle \mathcal{S}$ denote the set of probability distributions over a set $\mathcal{S}$. Consider an MDP with a set of states $\mathcal{S}$, set of actions $\mathcal{A}$, reward function $R : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$, transition function $T : \mathcal{S} \times \mathcal{A} \rightarrow \triangle \mathcal{S}$, and discount factor $\gamma : [0, 1)$. Let $R_\pi : \mathcal{S} \rightarrow \mathbb{R}$, $T_\pi : \mathcal{S} \rightarrow \triangle \mathcal{S}$, and $V_\pi : \mathcal{S} \rightarrow \mathbb{R}$ denote the reward, transition, and value functions induced by a policy $\pi : \mathcal{S} \rightarrow \mathcal{A}$, where the value function

\begin{align} V_\pi &= R_\pi + \gamma T_\pi R_\pi + \gamma^2 T_\pi^2 R_\pi + \gamma^3 T_\pi^3 R_\pi + \dots \\ &= R_\pi + \gamma T_\pi V_\pi \\ V_\pi &= (I - \gamma T_\pi)^{-1} R_\pi \end{align}

is the return (expected sum of discounted rewards) of that policy starting from any given state. Consider the problem of finding an optimal policy, i.e.

$$\pi^\ast \in \operatorname*{argmax}_{\pi : \mathcal{S} \rightarrow \mathcal{A}} V_\pi$$

To find such an optimal policy, one can use algorithms like value iteration, policy iteration, or linear programming. Now consider the problem of finding a policy that maximizes the sum of returns over a batch $\mathcal{I}$ of MDPs, i.e.

$$\pi^\ast \in \operatorname*{argmax}_{\pi : \mathcal{S} \rightarrow \mathcal{A}} \sum_{i : \mathcal{I}} (I - \gamma T_\pi^{(i)})^{-1} R_\pi^{(i)} $$

Has this problem been discussed in the literature? Is there an efficient algorithm for finding such an optimal policy? I tried some modifications/extensions of value iteration but they seem to converge only sometimes, at other times oscillating between a finite set of "solutions". For example, consider the algorithm that iterates \begin{align} Q^{(i)}(s, a) &\gets R^{(i)}(s, a) + \gamma \sum_{s' : \mathcal{S}} T^{(i)}(s, a, s') V^{(i)}(s') \\ \pi(s) &\gets \operatorname*{argmax}_{a : \mathcal{A}} \sum_{i : \mathcal{I}} Q^{(i)}(s, a) \\ V^{(i)}(s) &\gets Q^{(i)}(s, \pi(s)) \end{align} where \begin{align} V &: \mathcal{I} \times \mathcal{S} \rightarrow \mathbb{R} \\ Q &: \mathcal{I} \times \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R} \\ \pi &: \mathcal{S} \rightarrow \mathcal{A} \end{align} until convergence in $V$. Notice that $|\mathcal{I}| = 1$ yields ordinary value iteration. This algorithm seems to sometimes converge and sometimes enter a cycle in $V$.

Original Q&A

Maximizing sum of returns over multiple MDPs

Related Questions in OPTIMIZATION

Related Questions in MARKOV-PROCESS

Related Questions in CONTROL-THEORY

Related Questions in OPTIMAL-CONTROL

Related Questions in DYNAMIC-PROGRAMMING

Trending Questions

Popular # Hahtags

Popular Questions