Having some trouble understanding this proof in certain steps, even after trying to consult the matrix cookbook.
For two multivariate Gaussians $P_1, P_2 \in R^n$:
$KLD(P_1 || P_2) = E_{P_1}[\log P_1 - \log P_2]$
$= \frac{1}{2} E_{P_1}[-\log \det\Sigma_1 - (x - \mu _1)^T\Sigma_{1}^{-1}(x - \mu_1) + \log\det\Sigma_2 + (x - \mu _2)^T\Sigma_{2}^{-1}(x - \mu_2)]$
$= \frac{1}{2}\log \frac{\det\Sigma_2}{\det\Sigma_1} + \frac{1}{2}E_{P_1}[- (x - \mu _1)^T\Sigma_{1}^{-1}(x - \mu_1) + (x - \mu _2)^T\Sigma_{2}^{-1}(x - \mu_2)]$
$= \frac{1}{2}\log \frac{\det\Sigma_2}{\det\Sigma_1} + \frac{1}{2}E_{P_1}[tr (\Sigma_{1}^{-1}(x - \mu_1)(x - \mu _1)^T) + tr(\Sigma_{2}^{-1}(x - \mu_2)(x - \mu _2)^T)]$
$= \frac{1}{2}\log \frac{\det\Sigma_2}{\det\Sigma_1} + \frac{1}{2}E_{P_1}[tr (\Sigma_{1}^{-1}\Sigma_{1}) + tr(\Sigma_{2}^{-1}(xx^T - 2x\mu^{T}_{2} + \mu_2\mu_{2}^T)]$
Why does $(x-\mu)(x-\mu) = \Sigma_1$?
$= \frac{1}{2}\log \frac{\det\Sigma_2}{\det\Sigma_1} + \frac{1}{2}n + \frac{1}{2} tr(\Sigma_{2}^{-1}(\Sigma_1 + \mu_1\mu_{1}^T - 2\mu_2\mu^{T}_{1} + \mu_2\mu_{2}^T)]$
What rule gets rid of the EV?
$= \frac{1}{2}(\log \frac{\det\Sigma_2}{\det\Sigma_1} - n + tr(\Sigma_{2}^{-1}(\Sigma_1) + tr(\mu_{1}^T\Sigma_{2}^{-1}\mu_1 - 2\mu_{1}^T\Sigma_{2}^{-1}\mu_2 + \mu_{2}^T\Sigma_{2}^{-1}\mu_2)$
$= \frac{1}{2}(\log \frac{\det\Sigma_2}{\det\Sigma_1} - n + tr(\Sigma_{2}^{-1}(\Sigma_1) + (\mu_{2}-\mu_1)^T\Sigma_{2}^{-1}(\mu_{2}-\mu_1))$
How do you reduce (what is the rule) that last term?
Thanks
It doesn't, what you do have is $$ E_{P_1}[(X_1-\mu_1)(X_1-\mu_1)^T] = \Sigma_1, $$ and this is the definition of the covariance matrix $\Sigma_1$. This gives you the step $$ \begin{align} E_{P_1}\left[\operatorname{Tr}\left(\Sigma_{1}^{-1}(X_1-\mu_1)(X_1-\mu_1)^T \right)\right] &= \operatorname{Tr}\left(\Sigma_{1}^{-1}E_{P_1}\left[(X_1-\mu_1)(X_1-\mu_1)^T \right]\right) = \operatorname{Tr}(\Sigma_{1}^{-1}\Sigma_1). \end{align} $$
The rule is simply taking the expected value, and using the fact that the expectation and trace operator are interchangeable, also recall that $$ \Sigma_{1} = E_{P_1}[X_1 X_1^T] - \mu_1\mu_1^T, $$ then $$ \begin{align} E_{P_1}\left[\operatorname{Tr}\left(\Sigma_{2}^{-1}(X_1X_1^T-2X_1\mu_2^T+\mu_2\mu_2^T\right) \right] &= \operatorname{Tr}\left(\Sigma_{2}^{-1}E_{P_1}\left[X_1X_1^T - 2X_1\mu_2^T + \mu_2\mu_2^T\right]\right) \\ &=\operatorname{Tr}\left(\Sigma_{2}^{-1}\left[\Sigma_1 + \mu_1\mu_1^T - 2\mu_1\mu_2^T + \mu_2\mu_2^T\right]\right) \end{align} $$ where I have repeatedly used the linearity of expectation.
The rule used at the end is the trace trick, which allows us to write for instance $$ \begin{align*} \operatorname{Tr}(\Sigma_{2}^{-1}\mu_1\mu_2^T) &= \operatorname{Tr}(\mu_2^T\Sigma_{2}^{-1}\mu_1) \\ &= \mu_2^T\Sigma_{2}^{-1}\mu_1\\ &= \mu_1^T\Sigma_{2}^{-1}\mu_2 \\ &=\operatorname{Tr}(\mu_1^T\Sigma_{2}^{-1}\mu_2). \end{align*} $$ and combine this with the quadratic expansion $$ (\mu_2-\mu_1)^T\Sigma_{2}^{-1}(\mu_2-\mu_1) = \mu_2^T\Sigma_2^{-1}\mu_2 - 2\mu_2^T\Sigma_{2}^{-1}\mu_1 + \mu_1^T\Sigma_{2}^{-1}\mu_1. $$
That should be all you need to follow the steps involved.