I am struggling to understand the steps to vectorise a series of summations over several matrices, as applied to a machine learning algorithm. Though I have the final form, for my own benefit I want to understand how this is done.
Assuming the following:
$A$ is a $n \times n$ graph adjacency matrix, where $n$ denotes the number of nodes and $A_{ij}$ is the matrix element at $(i,j)$
$B$ is a $c \times n$ matrix where $n$ is the number of samples (nodes), $c$ is the number of possible labels
$C$ is a $n \times n$ matrix, for our purposes here it is the same as a $n \times n$ Identity matrix
$D$ represents the graph node degree matrix
So the equation is as follows,
$\hat{B} = \sum_{i}\sum_{j}A_{ij}\sum_{k}(B_{ik} - \sum_{l}B_{jl}C_{lk})^{2}$
This allegedly yields the following, when the above is translated into a series of matrix operations:
$\hat{B} = \mathtt{tr}(B^TDB - 2B^TABC + C^TB^TDBC)$
I cannot figure out how to arrive at this formulation, as I get confused with the outer summation over the term in brackets $\sum_{k}( \cdots )$ and the negative inner summation $ - \sum_{l}( \cdots )$ as it uses the $j$ from a summation on the far left.