I need a definition of neural networks in terms of mathematical mapping syntax. Since neural networks come in all different shapes I find it a little hard to come up with a definition that comprises all different types. For the case of feed forward networks without cycles, loops and skip connections a solid definition could be
$ \mathbb{R}^{n_0} \ni x_0 \mapsto x_N \in \mathbb{R}^{n_N}$ where $x_{k+1} = F_k(x_k)$ and $F_k: \mathbb{R}^{n_k} \to \mathbb{R}^{n_{k+1}} $.
In the case where forward skip connections are admissable, I think the definition would change to
$ \mathbb{R}^{n_0} \ni x_0 \mapsto x_N \in \mathbb{R}^{n_N}$ where $x_{k+1} = F_k(x_0,x_1,...,x_k)$ and $F_k: \mathbb{R}^{\sum n_k} \to \mathbb{R}^{\sum n_k + n_{k+1}} $
In the following paper "Neural Ordinary Differential Equations" the authors state
Models such as residual networks, recurrent neural network decoders, and normalizing flows build complicated transformations by composing a sequence of transformations to a hidden state: $h_{t+1} = h_t + f(h_t,\theta_t)$
This can obviously only be true for networks which same dimension in each layer.
Also other papers that I read claimed that residual networks (i.e. networks containing skip connections) can be represented in the above way. My questions are
- How can it be that recurrent neural network (i.e. networks contaning loops and/or cycles) can be represented in the way from the block quote ? Where happens the loop in the formula ?
- Also, how can residual networks with skip connections between non-consecutive layers be written in that way ? Where do we see the skip connections in the formula ?
- How would I need to change the second definition I gave above in order to fit most network types ?
There are many different types of RNNs with different governing equations. In general, they are written with the update rule $$ (a_t,h_t) = f_r(h_{t-1},x_{t-1}|\theta) $$ for which the rule you state is simply a "special case" with $f_r(h_t,x_t|\theta)=h_t + f(h_t,x_t|\theta)$ and ignoring the output-per-timestep $a_t$. The loop comes from iterating the process via $t=1,\ldots,n$. In other words, by "unfolding" the RNN. The equation you have is very general, since you can write $f(h_t,x_t) = g(h_t,x_t) - h_t$, meaning you're not very limited.
Also, as you said,
But for RNNs this is typically the case since $h_t$ is viewed as a "hidden state" that we repeatedly update given the new data at time $t$, denoted $x_t$, and the old hidden state (memory) $h_t$, to get $h_{t+1}$.
For residual networks, we typically do something like $$ x_\text{out} = F_\phi(x_\text{in}) + s_\theta(x_\text{in}) = F_\phi(x_\text{in}) + W_\theta x_\text{in} $$ where $x_\text{in}\in \mathbb{R}^{N_1}$ and $x_\text{out}\in \mathbb{R}^{N_2}$, so that $s_\theta(x) = x$ (or $W_\theta=I$) if $N_1 = N_2$. When the sizes do not match, the skip connection becomes a linear projection, in other words. (See the original resnet paper for details).
Let's suppose $N_1 = N_2$ so we have a skip connection rather than a projection. Then rewrite $x_\text{out} = h_{t+1}$, $x_\text{in} = h_{t}$, and $F_\phi(x_\text{in}) =: f(h_t,\theta_t)$ with $\phi=\theta_t$. This gives us $$ x_\text{out} = h_{t+1} = F_\phi(x_\text{in}) + x_\text{in} = f(h_t,\theta_t) + h_{t} $$ as desired. Skip connections between non-consecutive layers can be done by cramming two iterations into one operation, and "renaming" $t$ so that both layers are performed in one time step. Then the skip connection will produce the $h_t$ term and the operations of the intermediate layers can be encapsulated in $f$.
From what I can tell, your definition is general enough to cover most network types. Basically, you condition the function on all of the outputs of the intermediate layers $x_\ell$, yes? If so, then the only thing missing is a layer-specific input. For instance, in many RNNs (e.g., for reinforcement learning), we not only pass along and update a hidden state $h_t$, but we also receive an additional time-specific input, $u_t$ (often denoted $x_t$ but I'm trying to avoid confusing myself) and output an addition time-specific output, say $a_t$ (like performing an action at each timestep in a game).
However, these are special cases of your formula. Note that $x_{k+1} = F_k(x_0,\ldots,x_k)$, since we can merely lump $u_t$ and $a_t$ into $x_t$ and $x_{t+1}$. We cover RNNs by acting in a Markov manner and ignoring older values. We cover (non-consecutive) skip connections by virtue of conditioning on all the previous values.