Log probability of following a trajectory under an optimal policy

13 Views Asked by At

In reinforcement learning, is the log probability of following a trajectory under an optimal policy equal to the sum of rewards for that trajectory? i.e.

$\log(p(\tau)) = \sum^T_{t=1}r(s_t,a_t)$

I've seen this stated in this blog post: https://dibyaghosh.com/blog/probability/kldivergence.html ("We know that the probability of a trajectory under optimality is exponential in the sum of rewards received on the trajectory. $\log(p(\tau)) = \sum^T_{t=1}r(s_t,a_t)$")

If so, why? It feels like there might be a connection to information theory, since this would mean that minimising the reward maximises information (minimises $-\log(p(\tau))$). However I can't think of a way to prove to myself that this would be the case and I'm struggling to find supporting references.

Thank you for your help!