Functional Description of a Feed Forward Neural Network with 2 Hidden Layers

49 Views Asked by At

In the linked paper by Guliyev and Ismailov (paper) they claim at the top of the second page (correctly I believe) that the output of a 2 hidden layer feed forward neural network with $k$ units in the first layer and $m$ units in the second can be expressed as $$ \sum_{i=1}^m e_i \sigma\left( \sum_{j=1}^k c_{ij} \sigma(\mathbf{w}^{ij} \cdot \mathbf{x} - \theta_{ij}) - \xi_i \right) $$ Where $\mathbf{x}, \mathbf{w}^{ij} \in \mathbb{R}^d$ ($\mathbf{x}$ is the input to the network).

My question is why do the weight vectors $\mathbf{w}^{ij}$ need two indices, instead of just being indexed with $j?$ When I worked out analytically how many parameters I would expect a network described by the above function to have I got $$ \text{params} = 2*m +2*m*k + m*k*d $$ Where the $m*k*d$ term is coming from the $\mathbf{w}^{ij}$'s. This number is much larger than what you get when you implement a 2 hidden layer network in PyTorch and print out the number of parameters. It's also much larger than what I got when I worked out the number of parameters I would expect a network with the described architecture to have. I got $$ \text{expected params} = 2*m + m*k + k*d + k $$ Which matches PyTorch (if the output doesn't have a bias). If anyone can explain what I'm misunderstanding I'd really appreciate it. Note this is in the context of the universal approximation theorem with bounded depth and bounded width. In case that's useful for understanding the notation.