Partial-derivative in an artificial neural network. Product of vectors of different length?

83 Views Asked by At

Introduction

I am programming an artificial neural network to analyze the MNIST dataset of handwritten digits. Vector $\textbf{a}$ in layer $\textit{L}$ of length $\textit{i}$ in the network are given by:

$\textbf{a}^L_{i} = \sigma(\textbf{z}^L_i)$

Where vector $\textbf{z}$ is the sum of a product, of the weight matrix $\textbf{w}$ for layer $\textit{L}$ and the previous layer vector $\textbf{a}^{L-1}_j$ of length $\textit{j}$, and bias vector $\textbf{b}^L_i$.

$\textbf{z}^L_{i} = \textbf{w}^L_{ij}\cdot\textbf{a}^{L-1}_{j} + \textbf{b}^L_{i}$

and

$\sigma(\textbf{x}) = \displaystyle\frac{e^{\textbf{x}}}{1+e^{\textbf{x}}}$

To train the model the weights and biases must be adjusted to minimize a cost function. The cost function is squared error of the model vector output $\textbf{a}^L_{i}$ and the labeled true values $\textbf{y}_i$:

$C(\textbf{a}^L_{i}) = \displaystyle\sum_i(\textbf{a}^L_{i} - \textbf{y}_i)^2$

The chain rule is implemented to calculate for the derivative of the cost function with respect to the weights.

$\displaystyle\frac{\delta C}{\delta \textbf{w}^L_{ij}} = \frac{\delta C}{\delta \textbf{a}^L_{i}} \frac{\delta \textbf{a}^L_{i}}{\delta \textbf{z}^L_{i}} \frac{\delta \textbf{z}^L_{i}}{\delta \textbf{w}^L_{ij}}$

and

$\displaystyle \frac{\delta C}{\delta \textbf{a}^L_{i}} = 2\sum_i(\textbf{a}^L_{i} - \textbf{y}_i) $

$\displaystyle \frac{\delta \textbf{a}^L_{i}}{\delta \textbf{z}^L_{i}} = \sigma'(\textbf{z}^L_{i}) = \displaystyle\frac{e^{\textbf{z}^L_{i}}}{(1+e^{\textbf{z}^L_{i}})^2} $

$\displaystyle \frac{\delta \textbf{z}^L_{i}}{\delta \textbf{w}^L_{ij}} = \textbf{a}^{L-1}_{j} $

altogether yielding:

$\displaystyle\frac{\delta C}{\delta \textbf{w}^L_{ij}} = 2\sum_i(\textbf{a}^L_{i} - \textbf{y}_T)\frac{e^{\textbf{z}^L_{i}}}{(1+e^{\textbf{z}^L_{i}})^2}\textbf{a}^{L-1}_{j}$

Question

I was going to ask how to reconcile the fact that $\frac{\delta C}{\delta \textbf{w}^L_{ij}}$ includes the product of two vectors of different length. ( $\textbf{a}^L_{i}$ and $\textbf{a}^{{L-1}}_{j}$)

While preparing this question, typesetting it made me realize that one of those two vectors isn't a vector but the scalar sum of the difference between two vectors.

$2\sum_i(\textbf{a}^L_{i} - \textbf{y}_T)$ is a scalar value. Selah.

Rather than deleting all of this and going on my merry way, I am posting it in the hope it can serve to help someone who is learning about neural networks.