I'm reading the paper on Xavier initialization (Understanding the Difficulty of Training Deep Feedforward Neural Networks (Glorot and Bengio, AISTATS 2010)) and had a question regarding the derivation of an equation. I've been able to find material regarding other equations relevant to the paper, but for some reason haven't been able to find resources on these particular equations.
The specific part of the paper states:
For a dense artificial neural network using symmetric activation function $f$ with unit derivative at $0$ (i.e. $f'(0) = 1$), if we write $\mathbf{z}^i$ for the activation vector of layer $i$, and $\mathbf{s}^i$ the argument vector of the activation function at layer $i$, we have:
$$\begin{align}\mathbf{s}^i & = \mathbf{z}^i W^i + \mathbf{b}^i \\ \mathbf{z}^{i + 1} & = f\left(\mathbf{s}^i\right)\end{align}$$
From these definitions we obtain the following:
$$\begin{align}\dfrac{\partial\text{Cost}}{\partial s_k^i} = f'(s_k^i) W^{i + 1}_{k, \bullet} \dfrac{\partial \text{Cost}}{\partial \mathbf{s}^{i + 1}}\end{align} \\ \dfrac{\partial \text{Cost}}{\partial w^i_{l, k}} = z^i_l \dfrac{\partial \text{Cost}}{\partial s_k^i}$$
The equations with the partial derivatives are the two that I'm having trouble understanding. I've studied backpropagation before but I'm having a bit of trouble wrapping my head around how elements from the next layer (i.e. $W^{i + 1}$ and $\mathbf{s}^{i + 1}$) ended up in the equation for calculating the partial derivatives for layer $i$.
Is there something that I'm missing? I know this isn't exactly a machine learning Stack Exchange, but I was hoping there'd be someone familiar with the concept who could provide some pointers. Thanks!