I am currently trying to implement back propagation as described in the Wikipedia article.
It defines the gradient of the weights in layer $l$ as: $$\delta^l (a^{l-1})^T$$
where $a^{l}$ is is the output of layer $l$.
The article says:
Note that $\delta^l$ is a vector, of length equal to the number of nodes in level $l$; [...]
The number of entries of vector $a^{l}$ is equal to the number of nodes in layer $l$. But how can one calculate $\delta^l (a^{l-1})^T$ if layer $l-1$ and layer $l$ have a different number of nodes?