I have a problem: I want to calculate Kullback-Leibler (KL) divergence of two dataset where $X_{1}$ has $M$ features with its multivariate normal distribution $\mathcal{N}(\mu_1, \sigma_1)$ and $X_{2}$ have $M$ features with its multivariate normal distribution $\mathcal{N}(\mu_2, \sigma_2)$ random sample from one large dataset. I found that someone asked for a calculation of KL divergence for two univariate normal distributions in:
https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians
that have resulted of KL divergence: KL divergence of two univariate Gaussian distribution
How could I map this result for two multivariate normal distributions?
Thank you very much.
If $p\sim\mathcal N(\mu_1,\Sigma_1)$ and $q\sim\mathcal N(\mu_2,\Sigma_2)$ both of dimension $k$ (and both $\Sigma_i$ can be inverted) then
\begin{align*} \log p(x) = -\frac{k}{2} \log(2\pi) -\frac{1}{2}\log(\operatorname{det}(\Sigma_1)) - \frac{1}{2}(x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1)\\ \log q(x) = -\frac{k}{2} \log(2\pi)-\frac{1}{2}\log(\operatorname{det}(\Sigma_2)) - \frac{1}{2}(x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2) \end{align*}
and \begin{align*} \operatorname{KL}(p\|q)&=\mathbb E[\log p(X) - \log q(X)] \end{align*}
Where the expectation is on $X\sim p$.
Now, using a nice little trick with traces you get \begin{align*} \mathbb E[(X-\mu_1)^T \Sigma_1^{-1} (X-\mu_1)] &=\mathbb E[\operatorname{Tr}((X-\mu_1)^T \Sigma_1^{-1} (X-\mu_1))]\\ &=\mathbb E[\operatorname{Tr}(\Sigma_1^{-1} (X-\mu_1)(X-\mu_1)^T)]\\ &=\operatorname{Tr}(\Sigma_1^{-1}\mathbb E[(X-\mu_1)(X-\mu_1)^T])\\ &=\operatorname{Tr}(\Sigma_1^{-1} \Sigma_1)\\ &=\operatorname{Tr}(I_k)\\ &=k \end{align*}
we further have \begin{align*} &(x-\mu_2)^T \Sigma_2^{-1} (x-\mu_2) \\=& (x-\mu_1+\mu_1-\mu_2)^T \Sigma_2^{-1} (x-\mu_1+\mu_1-\mu_2)\\ =&(x-\mu_1)^T \Sigma_2^{-1} (x-\mu_1) + 2 (\mu_1-\mu_2)^T \Sigma_2^{-1} (x-\mu_1) + (\mu_1-\mu_2)^T \Sigma_2^{-1} (\mu_1-\mu_2) \end{align*} therefore by the same trick with the trace as before we get \begin{align*} &\mathbb E[(X-\mu_2)^T \Sigma_2^{-1} (X-\mu_2)]\\ = &\operatorname{Tr}(\Sigma_2^{-1}\Sigma_1) + 2\cdot 0 + (\mu_1-\mu_2)^T \Sigma_2^{-1} (\mu_1-\mu_2)\\ = &\operatorname{Tr}(\Sigma_2^{-1}\Sigma_1) + 2\cdot 0 + \operatorname{Tr}((\mu_1-\mu_2)^T \Sigma_2^{-1} (\mu_1-\mu_2))\\ = &\operatorname{Tr}(\Sigma_2^{-1}\Sigma_1) + 2\cdot 0 + \operatorname{Tr}(\Sigma_2^{-1} (\mu_1-\mu_2)(\mu_1-\mu_2)^T)\\ = &\operatorname{Tr}(\Sigma_2^{-1}(\Sigma_1+ (\mu_1-\mu_2)(\mu_1-\mu_2)^T))\\ \end{align*}
Putting things together yields \begin{align*} &\operatorname{KL}(p\|q)\\ =&\mathbb E[-\frac{k}{2} \log(2\pi) -\frac{1}{2}\log(\operatorname{det}(\Sigma_1)) - \frac{1}{2}(x-\mu_1)^T \Sigma_1^{-1} (x-\mu_1)]\\ &-\mathbb E[-\frac{k}{2} \log(2\pi)-\frac{1}{2}\log(\operatorname{det}(\Sigma_2)) - \frac{1}{2}(x-\mu_2)^T \Sigma_1^{-1} (x-\mu_2)]\\ =&\frac{1}{2}\log\left(\frac{\operatorname{det}(\Sigma_2)}{\operatorname{det}(\Sigma_1)}\right) - \frac k2 + \frac12\operatorname{Tr}(\Sigma_2^{-1}(\Sigma_1+ (\mu_1-\mu_2)(\mu_1-\mu_2)^T))\\ \end{align*}
In your case $\Sigma_i=\sigma_i^2 I$, therefore \begin{align*} &\operatorname{KL}(p\|q)\\ =&\frac{1}{2}\log\left(\frac{k\sigma_2^2}{k\sigma_1^2}\right) - \frac k2 + \frac{k\sigma_1^2+\operatorname{Tr}((\mu_1-\mu_2)(\mu_1-\mu_2)^T)}{2\sigma_2^2}\\ =&\log\left(\frac{\sigma_2}{\sigma_1}\right) - \frac k2 + \frac{k\sigma_1^2+\| \mu_1-\mu_2\|^2}{2\sigma_2^2}\\ \end{align*}