Value function of deep energy-based policies

17 Views Asked by At

I am reading the following paper: https://arxiv.org/pdf/1702.08165.pdf

I am stuck with one of their proofs, more precisely equation (14):

$$ H(\pi(\cdot|s))) + \mathbb{E}_{a \sim b}[Q_{soft}^{\pi}(s,a)] = -D_{KL}(\pi(\cdot |s) || \tilde{\pi}(\cdot |s)) + log\int exp(Q_{soft}^{\pi}(s,a)) da $$

I set:

$$ H(\pi(a|s)) = -\int \pi(a|s)log(\pi(a|s))da \\ \mathbb{E}[Q_{soft}^{\pi}(s,a)] = \int \pi(a|s)Q_{soft}^{\pi}(s,a)da \\ -D_{KL}(\pi(\cdot |s) || \tilde{\pi}(\cdot |s)) = \int \pi(a|s)log(\frac{\pi(a|s)}{\tilde{\pi}(a |s)}da $$

and use the definition of $$ \tilde{\pi}(a |s) = exp(Q_{soft}^{\pi}(s,a)) $$

In the end, I get $$ H(\pi(\cdot|s))) + \mathbb{E}_{a \sim b}[Q_{soft}^{\pi}(s,a)] = -D_{KL}(\pi(\cdot |s) || \tilde{\pi}(\cdot |s)) $$

I do not understand where this part comes from:

$$ log\int exp(Q_{soft}^{\pi}(s,a)) da $$

What am I missing?