I tried to work the math of how Kaiming He et al could come up with their initialization method in this paper.
Ignoring bias, we can get the variance of y as:
$$\mathrm{Var}[y] = n \mathrm{Var}[wx] $$
To get the Variance of [$wx$] I use this formula:
$$\mathrm{Var}[wx] = E[w^2]E[x^2] - [e[x]]^2[E[w]]^2$$
let $W$ has zero mean, so I left with:
$$\mathrm{Var}[wx] = E[w^2]E[x^2]$$
using the formula for variance, replace $E[w^2]$
$$\mathrm{Var}[wx] = \mathrm{Var}[w]E[x^2]$$
then substitute it back to the first equation
$$\mathrm{Var}[y] = n \mathrm{Var}[w]E[x^2] $$
from here, I am not sure how they could get the following:
$$\mathrm{Var}[y] = \frac12n \mathrm{Var}[w] \mathrm{Var}[y_{l-1}] $$
Can anyone explain to me? Thank You
$y_{l-1}$ has zero mean and has a symmetrical distribution around zero.
\begin{align}\mathrm{Var}(y_{l-1})&=\mathbb{E}[y_{l-1}^2] \\ &=\mathbb{E}[y_{l-1}^2|y_{l-1} > 0]\mathrm{Pr}[y_{l-1} > 0] + \mathbb{E}[y_{l-1}^2|y_{l-1} < 0]\mathrm{Pr}[y_{l-1} < 0]\\ &= 2\mathbb{E}[y_{l-1}^2|y_{l-1} > 0]\mathrm{Pr}[y_{l-1} > 0] \tag{1}\end{align} Since $$x_l = \max(0,y_{l-1})$$
Hence $$x_l^2 = \begin{cases} y_{l-1}^2 & , y_{l-1} > 0 \\ 0 & , y_{l-1} \leq 0\end{cases}$$
and have $$\mathbb{E}[x_l^2]=\mathbb{E}[y_{l-1}^2|y_{l-1}> 0]\mathrm{Pr}[y_{l-1 }>0 ]=\frac12 \mathrm{Var}(y_{l-1})$$