How to prove $span(B^2 v)\le \lambda\gamma span(Bv)$?

25 Views Asked by At

Here is a corollary from M.L. Puterman's Markov Decision Process(1994), the corollary is stated as follows.

Corollary 6.6.7: For all $v\in V$,$span(B^2 v)\le \lambda\gamma span(Bv)$.

The corollary follows a theorem:

Theorem 6.6.6: Define $\gamma \equiv \max_{s\in S,a\in A_s,s'\in S,a'\in A_{s'}} \left[1-\sum_{j\in S} \min[p(j|s,a),p(j|s',a') \right]$, then for any $u,v\in V$, we have $ span(Lv-Lu) \le \lambda\gamma span(v-u)$.

Here $span$ denote the span seminorm, which is $span(v) = \max_{s\in S} v(s) - \min_{s\in S} v(s)$. Span seminorm has some properties, such as: $span(v)\ge 0,\forall v\in V$, $span(v)=span(-v)$,$span(v+ke)=span(v)$,$span(v+u)\le span(v)+span(u)$. Here $k$ is a scalar, $e$ denote the vector which is all one in rows.

$Bv = \max_{d\in D} \{r_d + (\lambda P_d-I)v\}$, where $r_d$ is the reward function, $\lambda$ is the discount factor,$P_d$ is the transition matrix, $I$ is the identity matrix, $v$ is the value function, $d$ is a decision rule. $Lv = \max_{d\in D}\{r_d + \lambda P_dv\}$. Clearly, we have $Bv = Lv-v$.

Puterman wrote: "Choosing $u=Lv$ in Theorem 6.6.6 leads to the corollary". But I am quite confused, since taking $u=Lv$ into Theorem 6.6.6 does not give the result.

$span(Lv-L(Lv))\le\lambda\gamma span(v-Lv)=\lambda\gamma span(Bv),$ However, I could not prove the left side $span(Lv-L(Lv))?=span(B^2v)$.

Since $span(B^2v)=span(L(Bv)-Bv)=span(L(Lv-v)-(Lv-v))=span(L^2v-Lv-Lv+v)\le span(L^2 v-Lv) + span(v-Lv) = span(Lv-L^2 v) + span(Lv-v)$.

By theorem 6.6.6, we have $span(B^2 v) \le \lambda\gamma span(Bv) + span(Bv).$

I could not prove $span(B^2 v) \le \lambda\gamma span(Bv)$ from theorem 6.6.6. I could not figure out how. Can anyone tell me how? Or is the corollary incorrect?