Mathematics behind Neural Networks

276 Views Asked by At

I wasn't sure to ask this here, but based on the related questions, I think this is appropriate.

For starters, I'm trying to build a Neural Network using the website below as a reference. It seems like it's just trying to implement Perceptron with a single hidden layer.

https://causeyourestuck.io/2017/06/12/neural-network-scratch-theory/

Following the guide, it makes sense, but I'm having trouble understanding the mathematics behind the back propagation part. I understand that you have to use the error function to derive the rate of change for the biases and weights, but I'm confused as to how the derivatives (w.r.t. the parameters) ends up being a 'scalar' multiplication. The derivatives have to be the same size as the parameters, but how do you get to that point (using ${\partial J\over\partial B_2}$ as the example here)? Also, when deriving ${\partial J\over\partial W_2}$, using the chain rule, how does the derivative of ${\partial Y\over\partial W_2}$ result in the transpose of $H,$ and why does it end up being a dot product with the derivative of the error function (w.r.t the result)?

Sorry, my background in Matrix Algebra is not incredibly strong, so having the mathematics explained to me would really help a lot.

1

There are 1 best solutions below

6
On BEST ANSWER

The author seems to have conflated the terminology for several kinds of products of matrices, all of which are involved in the article:

  1. ordinary matrix product of matrices with compatible dimensions,
  2. scalar (or dot) product of row or column vectors with identical dimensions,
  3. Hadamard (or entrywise) product of matrices with identical dimensions, and
  4. outer product of vectors with possibly different dimensions.

We're given the scalar quantity $J$ defined as half of a scalar (dot) product $$J= {1\over 2}(Y-Y^*)(Y-Y^*)^\mathsf{T}={1\over 2}\sum_i(Y_i-Y_i^*)^2$$ where (supressing unneeded subsripts) $$Y=f(HW+B)$$ with $HW$ the ordinary matrix product of $H$ and $W$, the various matrices having the following dimensions: $$\begin{align} Y&:1\times y\\ H&:1\times h\\ W&:h\times y\\ B&:1\times y. \end{align}$$ Here $f(M)$ denotes the matrix whose $i$th element is just $f$ applied to the $i$th element of matrix $M$ (i.e., $[f(M)]_i = f(M_i),$ using $[...]_i$ to mean "the $i$th element of $...$"); so the $i$th element of $Y$ is $$\begin{align} Y_i&=f([HW+B]_i)\\ &=f([HW]_i+B_i)\\ &=f(\sum_jH_jW_{ji}+B_i).\end{align}$$

Now, we find ${\partial J\over\partial B}$ and ${\partial J\over\partial W}$, which are matrices with dimensions $1\times y$ and $h\times y$, respectively; thus: $$\left[{\partial J\over\partial B}\right]_k={\partial J\over\partial B_k}=\sum_i{\partial J\over\partial Y_i}{\partial Y_i\over\partial B_k} $$ where $$\begin{align}{\partial J\over\partial Y_i} &={\partial \over \partial Y_i} {1\over 2}\sum_j(Y_j-Y_j^*)^2\\ &={1\over 2}\sum_j2(Y_j-Y_j^*)\delta_{ij}\\ &=Y_i-Y_i^*\\ &=[Y-Y^*]_i \end{align}$$ and $$\begin{align}{\partial Y_i\over\partial B_k} &={\partial \over\partial B_k}f([HW]_i+B_i)\\ &=f'([HW]_i+B_i)\,\delta_{ik}\\ &=[f'(HW+B)]_i\,\delta_{ik} \end{align}$$ giving $$\begin{align}\left[{\partial J\over\partial B}\right]_k &=\sum_i [Y-Y^*]_i [f'(HW+B)]_i\,\delta_{ik}\\ &=[Y-Y^*]_k[f'(HW+B)]_k\\ &=[(Y-Y^*)* f'(HW+B)]_k. \end{align}$$ and therefore $${\partial J\over\partial B} =(Y-Y^*)* f'(HW+B) $$ where $*$ denotes the Hadamard ("entrywise") product of the two matrices having identical dimensions $(1\times y)$.

Similarly, $$\left[{\partial J\over\partial W}\right]_{kl}={\partial J\over\partial W_{kl}}=\sum_i{\partial J\over\partial Y_i}{\partial Y_i\over\partial W_{kl}} $$ where $$\begin{align}{\partial Y_i\over\partial W_{kl}} &={\partial \over\partial W_{kl}}f([HW]_i+B_i)\\ &=f'([HW]_i+B_i) {\partial \over\partial W_{kl}}[HW]_i\\ &=f'([HW]_i+B_i) {\partial \over\partial W_{kl}}\sum_jH_jW_{ji}\\ &=f'([HW]_i+B_i) \sum_jH_j\delta_{jk}\delta_{il}\\ &=[f'(HW+B)]_i\,H_k\,\delta_{il} \end{align}$$ giving $$\begin{align}\left[{\partial J\over\partial W}\right]_{kl} &=\sum_i [Y-Y^*]_i\,[f'(HW+B)]_i\,H_k\,\delta_{il}\\ &=H_k\,[Y-Y^*]_l\,[f'(HW+B)]_l\\ &=[H^\mathsf{T} ((Y-Y^*)* f'(HW+B))]_{kl}. \end{align}$$ and therefore $$\begin{align}{\partial J\over\partial W} &=H^\mathsf{T}\,((Y-Y^*)* f'(HW+B)) \end{align}$$ where the transpose is necessary to have compatible dimensions for the ordinary matrix multiplication; that is, $H^\mathsf{T}$ is $h\times 1$ and $(Y-Y^*)* f'(HW+B)$ is $1\times y$, producing an $h\times y$ result (called an outer product).