Introduction
I am programming an artificial neural network to analyze the MNIST dataset of handwritten digits. Vector $\textbf{a}$ in layer $\textit{L}$ of length $\textit{i}$ in the network are given by:
$\textbf{a}^L_{i} = \sigma(\textbf{z}^L_i)$
Where vector $\textbf{z}$ is the sum of a product, of the weight matrix $\textbf{w}$ for layer $\textit{L}$ and the previous layer vector $\textbf{a}^{L-1}_j$ of length $\textit{j}$, and bias vector $\textbf{b}^L_i$.
$\textbf{z}^L_{i} = \textbf{w}^L_{ij}\cdot\textbf{a}^{L-1}_{j} + \textbf{b}^L_{i}$
and
$\sigma(\textbf{x}) = \displaystyle\frac{e^{\textbf{x}}}{1+e^{\textbf{x}}}$
To train the model the weights and biases must be adjusted to minimize a cost function. The cost function is squared error of the model vector output $\textbf{a}^L_{i}$ and the labeled true values $\textbf{y}_i$:
$C(\textbf{a}^L_{i}) = \displaystyle\sum_i(\textbf{a}^L_{i} - \textbf{y}_i)^2$
The chain rule is implemented to calculate for the derivative of the cost function with respect to the weights.
$\displaystyle\frac{\delta C}{\delta \textbf{w}^L_{ij}} = \frac{\delta C}{\delta \textbf{a}^L_{i}} \frac{\delta \textbf{a}^L_{i}}{\delta \textbf{z}^L_{i}} \frac{\delta \textbf{z}^L_{i}}{\delta \textbf{w}^L_{ij}}$
and
$\displaystyle \frac{\delta C}{\delta \textbf{a}^L_{i}} = 2\sum_i(\textbf{a}^L_{i} - \textbf{y}_i) $
$\displaystyle \frac{\delta \textbf{a}^L_{i}}{\delta \textbf{z}^L_{i}} = \sigma'(\textbf{z}^L_{i}) = \displaystyle\frac{e^{\textbf{z}^L_{i}}}{(1+e^{\textbf{z}^L_{i}})^2} $
$\displaystyle \frac{\delta \textbf{z}^L_{i}}{\delta \textbf{w}^L_{ij}} = \textbf{a}^{L-1}_{j} $
altogether yielding:
$\displaystyle\frac{\delta C}{\delta \textbf{w}^L_{ij}} = 2\sum_i(\textbf{a}^L_{i} - \textbf{y}_T)\frac{e^{\textbf{z}^L_{i}}}{(1+e^{\textbf{z}^L_{i}})^2}\textbf{a}^{L-1}_{j}$
Question
I was going to ask how to reconcile the fact that $\frac{\delta C}{\delta \textbf{w}^L_{ij}}$ includes the product of two vectors of different length. ( $\textbf{a}^L_{i}$ and $\textbf{a}^{{L-1}}_{j}$)
While preparing this question, typesetting it made me realize that one of those two vectors isn't a vector but the scalar sum of the difference between two vectors.
$2\sum_i(\textbf{a}^L_{i} - \textbf{y}_T)$ is a scalar value. Selah.
Rather than deleting all of this and going on my merry way, I am posting it in the hope it can serve to help someone who is learning about neural networks.