I do not understand this notation for the sample covariance matrix (from Artificial Intelligence: A Modern Approach, Peter Norvig and Stuart J. Russell, Section 20.3, EM algorithm):
$\Sigma_{i} = \frac{\sum_{j}p_{ij}(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{\top}}{n_{i}}$
As far as I know, matrix dimensions do not match. From what I understand, $\mathbf{x}_{j}$ and $\mathbf{\mu}_{i}$ are row vectors of dimension $1\times d$. How can this yield a $d\times d$ matrix? Isn't $\Sigma_{i}$ the covariance matrix of mixed Gaussian distribution component $i$? But, isn't $(\mathbf{x}_{j}-\mathbf{\mu}_{i})(\mathbf{x}_{j}-\mathbf{\mu}_{i})^{\top}$ a scalar?
I also looked up Wikipedia (https://en.wikipedia.org/wiki/Sample_mean_and_covariance). I understand this notation:
$q_{jk}=\frac{1}{N-1}\sum_{i=1}^{N}(x_{ij}-\overline{x_j})(x_{ik}-\overline{x_{k}})$
for elements of the sample covariance matrix (Q). But again, not this one:
$Q=\frac{1}{N-1}\sum_{i=1}^{N}(\mathbf{x}_{i.}-\overline{\mathbf{x}})(\mathbf{x}_{i.}-\overline{\mathbf{x}})^{\top}$
What am I missing here?
Unless explicitly stated otherwise, vectors are generally assumed to be $d \times 1$ column vectors, not $1 \times d$ row vectors.
So, in both of the cases which you didn't understand (the first and the third cases), we have expressions like
\begin{equation*} (\mathbf{a} - \mathbf{b})(\mathbf{a} - \mathbf{b})^\top, \end{equation*}
where $\mathbf{a}$ and $\mathbf{b}$ (and therefore also "$(\mathbf{a} - \mathbf{b})$") are $d \times 1$ column vectors. Naturally, the expression $(\mathbf{a} - \mathbf{b})^\top$ then gives a $1 \times d$ row vector due to the transpose. Multiplying the two in that order then gives a $d \times d$ matrix.