Understanding an equation used in the Xavier initialization for deep neural networks paper

71 Views Asked by At

I'm trying to understand an equation from the paper "Understanding the difficulty of training deep feedforward neural networks", available here. The equation in question is on the bottom left of page 5.

The authors write, assuming $f$ is linear.

For a dense artificial neural network using symmetric activation function $f$ with unit derivative at $0$ , if we write $z^i$ for the activation vector of layer $s$ and $s^i$ the argument vector of the activation function at layer $i$, we have $s^i = z^i W^i + b^i$ and $z^{i+1} = f(s^i)$. From these definitions we obtain the following:

$ \frac{\partial Cost} {\partial w^{i}_{l,k}} = z^{i}_{l} \frac{\partial Cost}{\partial s^{i}_{k}}$ , and $\frac{\partial Cost}{\partial s^{i}_{k}} = f'(s^{i}_{k}) W^{i+1}_{k, :}\frac{\partial Cost}{\partial s^{i}_k}$.

The variances will be expressed with respect to the input, output and weight initialization randomness. Consider the hypothesis that we are in a linear regime at the initialization, that the weights are initialized independently and that the input features variances are the same $( = Var[x])$. Then we can say that, with $n_i$ the size of layer $i$ and $x$ the network input,

$f'(s^{i}_{k}) \approx 1$ and $Var[z^i] = Var[x] \Pi_{i'=0}^{i-1}n_i'Var[W^i]$.

From what I understand this last statement follows from the assumption that $E[x] = E[W^i] = 0$, and hence follows from the formula that for two independent random variables $X,Y$, that $Var(XY) = Var(X)Var(Y) + E(X)^2Var(Y) + E(Y)^2Var(X)$.

Next the authors write:

We write $Var[W^i]$ for the shared scalar variance of all weights at layer $i'$. Then for a network with $d$ layers,

\begin{equation} Var[ \frac{\partial Cost}{\partial s^i}] = Var[\frac{\partial Cost}{\partial s^d}] \Pi_{i'=i}^{d} n_{i'+1} Var [W^i] \tag{1}\end{equation}

\begin{equation} Var[\frac{\partial Cost}{\partial w^i}] = \Pi_{i'=0}^{i-1} n_{i'} Var [W^{i'}] \Pi_{i'=i}^{d-1} n_{i'+1} Var [W^{i'}] Var[x] Var [\frac{\partial Cost}{\partial s^d}] \tag{2} \end{equation}

I'm trying to understand equation $1)$ and $2)$ tagged above. For equation $1)$ since $\frac{\partial Cost}{\partial s^i} = \frac{\partial Cost}{\partial s^{i+1}} \frac{\partial s^{i+1} }{\partial s_i}$, where $s_1 = W^{1} x + b^{i}$, and the fact that $Var[W^i x] = Var[W^i]Var[x]$, and $\frac{\partial s^{i+1} }{\partial s_i} = W^{i}$, I think the result follows.

I'm not sure how to derive equation $2)$, any insights appreciated.