Understanding mathematics: Deriving the Q learning Equation.

54 Views Asked by At

I am trying to implement this article. I have fairly understood the mathematics. But I am unable to understand where the following term comes from. It is located in the middle of page no. 4.

$$v_\pi(s) = \sum_{a \in \mathcal A} \pi(a \mid s) q_{\pi}(s, a)$$

I am not able to understand if it comes from Bayes Theorem or something else.

1

There are 1 best solutions below

7
On BEST ANSWER

To unpack that, we need some other definitions in the paper:

  • $\pi(a \mid s) = \mathbb P \left[ A_t = a \mid S_t = s \right]$
  • $q_\pi(s, a) = \mathbb E_\pi \left[ G_t \mid S_t = s, A_t = a \right]$
  • $v_\pi(s) = \mathbb E_\pi \left[ G_t \mid S_t = s \right]$

The big idea is to decompose across the possible values of $A_t$:

\begin{align*} v_{\pi}(s) &= \mathbb E_\pi \left[ G_t \mid S_t = s \right] \\ &=\sum_{a \in \mathcal A} \mathbb E_\pi \left[ G_t \mid S_t = s, A_t = a \right] \cdot \mathbb P(A_t = a \mid S_t = s)\\ \end{align*} This is hopefully intuitively reasonable, but it also kind of begs the question of why it's justified. Here's a more rigorous justification using indicator variables and the definition of conditional expectation:

\begin{align*} v_{\pi}(s) &= \mathbb E_\pi \left[ G_t \mid S_t = s \right] \\ &= \sum_{a \in \mathcal A} \mathbb E_\pi \left[G_t \cdot \textbf 1_{A_t = a} \mid S_t = s\right] \\ &= \sum_{a \in \mathcal A} \sum_{x} x \cdot \mathbb P \left(G_t = x, A_t = a \mid S_t = s\right) & \bigstar \\ &= \sum_{a \in \mathcal A} \sum_{x} x \cdot \frac{\mathbb P \left(G_t = x, A_t = a, S_t = s \right)}{\mathbb P(S_t = s)}\\ &= \sum_{a \in \mathcal A} \sum_{x} x \cdot \frac{\mathbb P \left(G_t = x, A_t = a, S_t = s \right)}{\mathbb P(A_t = a, S_t = s)} \cdot \frac{\mathbb P(A_t = a, S_t = s)}{\mathbb P(S_t = s)} \\ &= \sum_{a \in \mathcal A} \sum_{x} x \cdot \mathbb P \left(G_t = x \mid A_t = a, S_t = s \right) \cdot \mathbb P \left(A_t = a \mid S_t = s \right) \\ &= \sum_{a \in \mathcal A} \mathbb P \left(A_t = a \mid S_t = s \right) \cdot \left[ \sum_{x} x P \left(G_t = x \mid A_t = a, S_t = s \right) \right] \end{align*}

as desired.

(Note: the equality on line $\bigstar$ is just a bit odd, but the idea is: for $x \neq 0$, $G_t \cdot 1_{A_t = a} = x$ holds if and only if $G_t = x$ and $A_t = a$. When $x = 0$, the term vanishes from the sum altogether.)