Context: I was recently implementing (in Python) the Expectation-Maximization (EM) algorithm for Gaussian mixture models, and part of that process involves computing the Gaussian PDF for various points $x_i$ with various parameters. I implemented it somewhat naively, which was fast enough for what I needed, but I feel like it could be computed faster if it was written in a nice, clean matrix form. However, I don't know how you'd approach getting it in the form you'd need.
The multivariate Gaussian PDF I need is in the form:
$$ p_k(x_i \mid \mu_k, \Sigma_k) = \frac{1}{(2\pi)^\frac{d}{2} \vert \Sigma_k \vert^{\frac{1}{2}}} e^{-\frac{1}{2} (x_i - \mu_i)^T \Sigma_k^{-1} (x_i - \mu_k)} $$
Notation:
- $x_i$ is a $d$-dimensional vector of input data, where $1 \le i \le N$ ($N$ is total number of data points)
- $k \in {1, 2, ..., K}$ is another parameter over which I have to calculate $p(x_i)$ for each value of $k$
- $\mu_k$ is $d$-dimensional set of means for the data
- $\Sigma_k$ is a $d \times d$ covariance matrix for the data
My main points of confusion:
The equation as I typed above gives a scalar probability for a single $i$ and a single $k$. I need to calculate it for all $i$, $k$, so in the end I'd want an $N \times K$ matrix of values.
The main portion of the exponential part, $(x_i - \mu_i)^T \Sigma_k^{-1} (x_i - \mu_k)$, effectively takes a $1 \times d$ vector, multiplies by a $d \times d$ matrix, and then multiplies again by a $d \times 1$ vector. With a single data point we get a $1 \times 1$ scalar as a result, but with more than one data point wouldn't we get an $N \times N$ matrix?
Can anyone help me figure out a more elegant way to write this? Again, what I'd ultimately like to end up with is some one-liner to get an $N \times K$ matrix that I supposed you'd write as: $p(X \mid \mu, \Sigma)$ where $X$ is an $N \times d$ matrix, $\mu$ is a $k \times d$ matrix, and $\Sigma$ is a $k \times d \times d$ matrix.
The result you need is actually the diagonal of the $n\times n$ matrix you get when computing $(X - \mu)\Sigma^{-1} (X - \mu)^T$, where $x$ is $N\times d$, $\mu$ is $1\times d$ and $\Sigma$ is $d\times d$.
One way to get that without computing the entire the matrix (just the diagonal) is to compute the individual sums of the scalar product of rows of $(X - \mu)$ and columns of $\Sigma^{-1} (X - \mu)^T$. See this answer.
In Python, this gives:
where '*' is the term-wise product, and '@' is the matrix product.
Now, you seem to have a $\mu$ of size $k\times d$ and $\Sigma$ of size $k\times d \times d$, which is an additional level of vectorization, and I'm not sure how to deal with that. What shape would $(X - \mu)$ be then? $k\times N \times d$?