I believe I am missing something basic about Jacobian vector products which is confusing me on some small details of backpropagation...
Basically, the chain rule requires multiplication of derivatives together, and this is what we do in order to backpropagate.
$$ \begin{align} z = Wx \\ a = \sigma(z) \\ L = l(a) \\ \frac{\partial L}{w_{ij}} = \frac{\partial L}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial w_{ij}} \end{align} $$
This only gives the gradient of the weights w.r.t. a single weight index. We could get the derivative of the whole weight matrix at once by doing the following...
$$\frac{\partial L}{W} = \frac{\partial L}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial W}$$
but the problem is that the last term has a high dimensionality with $\frac{\partial z}{\partial W} = D \rightarrow D_{ijk} = \frac{\partial t_i}{W_{jk}}$. Instead of going through the hassle of constructing this high dimensional tensor, we instead just take the individual derivatives $w_{ij}$, which is what is called the 'Jacobian vector product' when it is done in a single operation as an outer product . So only taking the last two terms of the partial dervative gives the following...
$$ \begin{bmatrix} \frac{\partial a_1}{\partial z_1} \\ \vdots \\ \frac{\partial a_n}{\partial z_n} \end{bmatrix} \begin{bmatrix} \frac{\partial z_1}{\partial w_{1j}} \\ \vdots \\ \frac{\partial z_1}{\partial w_{1k}} \end{bmatrix}^{\top} $$
Is this a correct understanding?