Specifically, I am interested in how the covariance matrix is calculated.
In terms of dimensions of factors involved, let's say I am given some data set X of dimension m x d, covariance matrix S of dimension d x d x k, and mean matrix M of dimension k x d. Where m is number of samples, d dimensionality of each sample, and k is the number of clusters.
Intuitively it makes sense that the covariance matrix is just the regular formulation of covariance of all data points w.r.t. to the mean in each cluster k, weighted by the likelihood of being in cluster k, scaled.
However, I am having trouble converting the summation notation to matrix notation. Specifically, I need to verify that the covariance matrix should be defined as
`(X - Mu)(X - Mu)'`,
which has dimension d x d. Furthermore, I am confused as to how the weight W should be applied to each data vector. If I weigh X - Mu, then take the dot product, wouldn't the weights be applied twice to X? So it follows that I should multiply X - Mu by the square root of W, but I need to verify this is true.
Let's say the probability of point $x_j$ being in cluster $k$ is $w_{jk}$. Then the weighted mean is $\mu_k=\sum_{j}w_{jk}x_j/\sum_{j}w_{jk}$.
The covariance matrix then is $S_k=\sum_{j} w_{jk}(x_j-\mu_k)(x_j-\mu_k)^\top/\sum_{j}w_{jk}$.
Now one can express the sums as matrix products. So set $W_k$ as the diagonal matrix with diagonal entries $w_{jk}/\sum_{j}w_{jk}$. Then
$\mu_k=XW1\!\!1$ and $S_k=(X-\mu_k1\!\!1^\top)W_k(X-\mu_k1\!\!1^\top)^\top=XW_kX^\top-\mu_k\mu_k^\top$