Understand the math behind He-Initialization

164 Views Asked by At

I tried to work the math of how Kaiming He et al could come up with their initialization method in this paper.

Ignoring bias, we can get the variance of y as:

$$\mathrm{Var}[y] = n \mathrm{Var}[wx] $$

To get the Variance of [$wx$] I use this formula:

$$\mathrm{Var}[wx] = E[w^2]E[x^2] - [e[x]]^2[E[w]]^2$$

let $W$ has zero mean, so I left with:

$$\mathrm{Var}[wx] = E[w^2]E[x^2]$$

using the formula for variance, replace $E[w^2]$

$$\mathrm{Var}[wx] = \mathrm{Var}[w]E[x^2]$$

then substitute it back to the first equation

$$\mathrm{Var}[y] = n \mathrm{Var}[w]E[x^2] $$

from here, I am not sure how they could get the following:

$$\mathrm{Var}[y] = \frac12n \mathrm{Var}[w] \mathrm{Var}[y_{l-1}] $$

Can anyone explain to me? Thank You

1

There are 1 best solutions below

0
On BEST ANSWER

$y_{l-1}$ has zero mean and has a symmetrical distribution around zero.

\begin{align}\mathrm{Var}(y_{l-1})&=\mathbb{E}[y_{l-1}^2] \\ &=\mathbb{E}[y_{l-1}^2|y_{l-1} > 0]\mathrm{Pr}[y_{l-1} > 0] + \mathbb{E}[y_{l-1}^2|y_{l-1} < 0]\mathrm{Pr}[y_{l-1} < 0]\\ &= 2\mathbb{E}[y_{l-1}^2|y_{l-1} > 0]\mathrm{Pr}[y_{l-1} > 0] \tag{1}\end{align} Since $$x_l = \max(0,y_{l-1})$$

Hence $$x_l^2 = \begin{cases} y_{l-1}^2 & , y_{l-1} > 0 \\ 0 & , y_{l-1} \leq 0\end{cases}$$

and have $$\mathbb{E}[x_l^2]=\mathbb{E}[y_{l-1}^2|y_{l-1}> 0]\mathrm{Pr}[y_{l-1 }>0 ]=\frac12 \mathrm{Var}(y_{l-1})$$