How can I derive the equation of this paper(Markov Chains, Poisson Equation)?

174 Views Asked by At

I am currently reading the paper titled Actor-Critic Algorithms.

I can not understand part of paper saying 'We are interested in minimizing $\lambda(\theta)$ over all $ \theta $. For each $\theta \in \Bbb R^n $, let $V_\theta : S \rightarrow \Bbb R $ be the "differential" cost function, defined as solution of Poisson equation:' $\lambda(\theta) + V_\theta(x) = \sum_{u\in A}\mu_\theta (x, y)\biggl[g(x, y) + \sum_{y}p_{xy}(u)V_\theta(y)\biggl]$.

It seems like they derive $V_\theta$ by using the Poisson equation. However, I can not assume where does it come from.

Below are the conditions, notations they provide for that.

Consider a Markov decision process with finite state space $S$, and finite action space $A$. Let $g : S \times A \rightarrow \Bbb R $ be given cost function. A randomized stationary policy (RSP) is a mapping $ \mu $ that assigns to each state $x$ a probability distribution over the action space $A$. We consider a set of randomized stationary policies $ \Bbb P = \{ \mu_\theta; \theta \in \Bbb R^n \}$, parameterized in terms of a vector $ \theta $. For each pair $(x,u) \in S \times A, \mu_\theta(x,u) $ denotes the probability of taking action $u$ when the state $x$ is encountered, under the policy corresponding to $\theta $. Let $ p_{xy}(u) $ denote the probability that the next state is $ y $, given that the current state is $ x $ and the current action is $ u $. Note that under any RSP, the sequence of states $\{X_n\} $ and of state-action pairs $ \{X_n, U_n\} $ of the Markov decision process form Markov chains with state spaces S and $ S \times A $, respectively. We make the following assumptions about the family of policies $ \Bbb P $.

(A1) For all $x \in S$ and $u \in A$ the map $\theta \rightarrow \mu_\theta(x, u) $ is twice differentiable with bounded first, second derivatives. Furthermore, there exists a $ \Bbb R^n $ - valued function $\psi_\theta(x, u) $ such that $\nabla \mu_\theta(x,u) = \mu_\theta(x,u)\psi_\theta(x,u) $ where the mapping $\theta \rightarrow \psi_\theta(x, u) $ is bounded and has first bounded derivatives for any fixed x and u.

(A2) For each $\theta \in \Bbb R^n $, the Markov chains ${X_n}$ and ${X_n, U_n}$ are irreducible and aperiodic, with stationary probabilities $\pi_\theta(x)$ and $\eta_\theta(x, u) = \pi_\theta \mu_\theta(x,u) $, respectively, under the RSP $\mu_\theta $.

In reference to Assumption (A1), note that whenever $\mu_\theta(x, u)$ us nonzero we have $ \psi_\theta(x, u) = \frac{\nabla \mu_\theta(x,u)}{\mu_\theta(x,u)} = \nabla \ln \mu_\theta(x, u) $

Consider the average cost function $\lambda : \Bbb R^n \rightarrow \Bbb R $ given by

$\lambda(\theta) = \sum_{x \in S, u\in A}g(x, u)\eta_\theta(x,u)$.

Can anybody give me a more detailed information about how to derive that equation?

Thank you for reading.