In a famous book Stochastic Optimal Control: The Discrete-Time Case by Bertsekas and Shreve they use universally measurable policies that come up with some handy features:
- e.g. they show that every such policy can be replaced by an equivalent Markov policy
there may no exist a single Borel policy, but there always exists at least one universally measurable policy
it ensures existence of everywhere $\varepsilon$-optimal policy, which may not exist in the class of policies that are only Borel measurable
However, thanks to Lemma 7.28(c) in the very same book we have that the conditional kernel $q:X\to \mathcal P(X)$ is universally measurable if and only for any probability measure $p\in \mathcal P(X)$ there exists a Borel measurable kernel $q_p: X\to\mathcal P(X)$ such that the equality $q = q_p$ holds $p$-a.e. on $X$.
I guess, this is enough to show that for any initial distribution $\alpha$ and any universally measurable policy $\pi$ there exits a Borel policy $\pi_\alpha$ such that the corresponding probability measures on the path space coincide: $\mathsf P^\pi_\alpha = \mathsf P^{\pi_\alpha}_\alpha$. If the latter fact were indeed true, then it means that for any random variable $f:H\to\Bbb R$ on the path space and any initial distribution $\alpha$ it holds that $$ \sup_{\pi:\text{ universally }}\int_H f\;\mathsf P^\pi_\alpha = \sup_{\pi':\text{ Borel}}\int_H f\;\mathsf P^{\pi'}_\alpha. $$ Is it true? And if that, what is the main reason for dealing with universally measurable policies?
Some background and notation for the question above: let $X$ be a Borel state space, let $U$ be a Borel control space, $K\subseteq X\times U$ is an analytic set whose sections $K_x = \{u\in U:(x,u)\in K\}$ shall be though of controls "available" at the state $x\in X$. Let $\mathcal P(X)$ denote the space of probability measures on $X$ endowed with the topology of weak convergence. Put $t:X\times U\to \mathcal P(X)$ be a Borel transition kernel. We also denote $H_n = (X\times U)^n\times X$ for $n\in \Bbb N_0\cup\{\infty\}$ the spaces of finite and infinite paths.
The policy $\pi = (\pi_n)_{n\in \Bbb N_0}$ is a sequence of universally measurable kernels $\pi_n:H_n\times X\to \mathcal P(U)$ such that $\pi_n(K_x|h_n,x) = 1$ for all $h_n\in H_n$ and $x\in X$.
The latter condition means that controls can be chosen only among those that are available. For any policy initial distribution $\alpha\in \mathcal P(X)$ and any policy $\pi$ the probability measure on the path space $H:=H_\infty$ is denote by $\mathsf P^\pi_\alpha$ and it is the unique measure that satisfies $$ \begin{split} \int_H f\; d \mathsf P^\pi_\alpha = &\int_X\int_U\int_X\dots\int_X\int_U f(x_0,u_0,x_1,\dots,x_{n-1},u_{n-1},x_n,\dots) \\[2mm] &\times t(d x_n|x_{n-1},u_{n-1})\pi_{n-1}(d u_{n-1}|x_0,u_0,\dots,x_{n-1}) \\[2mm] &\times t(d x_{n-1}|x_{n-2},u_{n-2})\cdots t(d x_1|x_0,u_0)\pi_0(d u_0|x_0)\alpha(d x_0), \end{split} $$