Is there always an optimal deterministic policy in MDPs with a continuous state space?

288 Views Asked by At

It is well known that in MDPs with a discrete state space, there exists a deterministic policy that is optimal, in the sense that it maximizes the total expected discounted reward from any initial state. I was wondering if this is also true for MDPs with a continuous state space and looking for a reference where this result was derived. Thanks in advance!

1

There are 1 best solutions below

1
On BEST ANSWER

First, I do want to point out that the existence of optimal policies is not guaranteed in discrete problems when the discount factor $\beta = 1$.

Example. Let $\mathbb{X} = \{1,2,\dots\} \cup Z$ be the state space with $Z = \{z_1, z_2\}$ as dead states, and let $\mathbb{A} = \{s,c\}$ be the action space. The dynamics of the system follow the rule \begin{equation} x_{t+1} = f(x_t,a_t) = \begin{cases} z_2 & \text{if $x \in Z$,} \\ z_1 & \text{if $x \notin Z$ and $a_t = s$,} \\ x_t+1 & \text{if $x \notin Z$ and $a_t = c$.} \end{cases} \end{equation} Starting from $x=1,2,\dots$ you either move right, or you go to the first dead state $z_1$, and then proceed to the next dead state $z_2$, where you stay forever.

The cost function \begin{equation} c(x,a) = \begin{cases} x^{-1} & \text{if $x \notin Z$ and $a=s$,} \\ -1 & \text{if $x = z_1$,} \\ 0 & \text{otherwise,} \end{cases} \end{equation} is bounded, $0 \leq c(x,a) \leq 1$. If you "die" at state $x = 1, 2, \dots$ you pay a total of $x^{-1}-1$. Otherwise, you keep moving right at no cost.

Let's calculate the value function \begin{equation} v(x) = \inf_{\pi} v^{\pi}(x) := \sum_{t=0}^{\infty} c(x_t,\pi(x_t)) \end{equation} where the policies $\pi$ are deterministic (relaxing this does not affect the analysis).

Clearly, $-1 \leq v(x) \leq 0$. In fact, by "dying" on or after the state $x=n$, one pays a cost no greater than $n^{-1}-1$. Let $\pi_n$ denote this policy. Then for $x=1,2,\dots$, we obtain $v(x) \leq v^{\pi_n}(x) \leq n^{-1}-1$ for all $n$; hence, $v(x) = -1$ for each $x=1,2,\dots$. But there is no policy that achieves this value.

Continuous state space models

Very little can be said in continuous state space models without additional assumptions. For example, the value function may not be well-defined unless measurability considerations are taken into account.

It is common to have three types of general assumptions that enable us to say that the infinite horizon value functions are well-defined:

Assumption P. The costs $c(x,a)$ are nonnegative for all $x$ and $a$.

Assumption N. The costs are nonpositive for all $x$ and $a$.

Assumption D. The discount factor $\beta \in (0,1)$, and the costs are bounded: $|c(x,a)| < M$ for all $x$ and $a$.

Essentially, these do away with convergence issues in the infinite sum.

There are, broadly speaking, two major directions these assumptions go in: semicontinuous models and Borel models. I could be biased here because this is my research area, but my impression is that most people care about problems that fall into some form of a semicontinuous model. I am not going to touch Borel models here. They rely heavily on arcane topics in measure theory and related concepts, such as analytic sets and Suslin spaces. See the references at the end. Here is a typical presentation of the assumptions in the semicontinuous model:

  1. $\mathbb{X}$ and $\mathbb{A}$ are Borel subsets of Polish (complete, separable) metric spaces;

  2. $c(x,a)$ is lower semicontinuous and bounded below;

  3. Either property holds:

    1. $\mathbb{A}$ is compact;
    2. $\{(x,a) : c(x,a) \leq \lambda\}$ is compact for all $\lambda \in \mathbb{R}$ ($c$ is called inf-compact);
  4. The transition kernel $P(dx_{t+1}|x_t,a_t)$ is weakly continuous.

Under assumptions like these, the value functions \begin{align} v_n(x) &= \inf_{\pi} v_{n}^\pi(x) := \mathbb{E}_x^\pi \sum_{t=0}^{n-1} \beta^t c(x_t, a_t) \\ v(x) &= \inf_{\pi} v^\pi(x) :=\mathbb{E}_x^\pi \sum_{t=0}^{\infty} \beta^t c(x_t, a_t) \end{align} are well defined, where $\pi$ is a policy and $\mathbb{E}_x^\pi$ denotes the expectation operator for the process starting at $x$ under $\pi$. In the most general form, policies are sequences of stochastic kernels $\pi = (\pi_1, \pi_2, \dots)$ where $\pi_t(da_t|x_0,a_0,x_1,a_1,\dots,x_{t})$ depends on the entire history of the process. However, in many settings it is sufficient to consider some combination of deterministic, Markov, or stationary policies.

Under the above assumptions, we get the following facts:

  • the finite-horizon value functions $v_{n}$ exist, are lower semicontinuous, and satisfy the optimality equations $$ v_{n+1}(x) = \min_{a \in \mathbb{A}} \{c(x,a) + \beta \int_{\mathbb{X}} v_n(y) \ P(dy|x,a) \}. $$
  • the infinite horizon value function $v$ exists and is the pointwise limit of the sequence $\{v_n\}_{n=1}^{\infty}$ through value iterations. It satisfies the optimality equation $$ v(x) = \min_{a \in \mathbb{A}} \{c(x,a) + \beta \int_{\mathbb{X}} v(y) \ P(dy|x,a) \}. $$
  • there are deterministic optimal policies for the finite-horizon problems, and there are deterministic stationary optimal policies for the infinite-horizon problem.

It is possible to tailor some of these assumptions in different directions, but this is the gist.

Some texts you may find useful:

  • Bertsekas, D.P. and Shreve, S.E. Stochastic Optimal Control: The Discrete-Time Case (1996), Athena Scientific.

  • Hernández-Lerma, O. Adaptive Markov Control Processes (1989), Springer.

  • Feinberg, E.A. and Shwartz, A. Handbook of Markov Decision Processes (2002), Springer.

For a recent paper along these lines (including the average cost criterion), see here.