A confusion on the variance of weight matrix of Kaiming He's initialization

119 Views Asked by At

When trainning neural networks with ReLU function, Kaiming He's initialization is a good way to initialize parameters. In the paper "Delving Deep into Rectifiers", the author gave detailed derivation process. But I am really confused on the derivation.

Suppoese ${\bf y}_l = {\bf W}_l{\bf x}_l$ where ${\bf x}_l,{\bf y}_l$ are vectors and ${\bf W}_l$ is a matrix. The author considers not only ${\bf x}_l$ as a random variable, but also ${\bf W}_l$ as a random variable with iid. elements, and derives the following equation:

$$Var[y_l]=n_lVar[w_l]E[x^2_l]$$

where $y_l,w_l,x_l$ are elements of corresponding vectors or matrix. But I think when considering the variance of $y_l$, we should fix ${\bf W}_l$ rather than regarding ${\bf W}_l$ as a probability distribution. That is, we should focus on the probability distribution of input.

In this case, if ${\bf W}_l$ is fixed to have zero mean on each row, and denote $Var[w_l]$ as the variance of elements on a row (which is the same for all rows). We will derive a different equation: $$Var[y_l]=Var[\sum_i w_{l,i}x_{l,i}]=\sum_i w^2_{l,i}Var[x_{l,i}]=n_lVar[w_l]Var[x_l]$$

Should it be more reasonable? But $Var[x_l]$ can be much smaller than $E[x^2_l]$, which means $$\frac 1 2 n_lVar[w_l] = 1$$ should be modified to: $$\frac 1 c n_lVar[w_l] = 1$$ where $c>2$. For instance, if $x_l$ are Gaussion variables, $c$ appoxiamately equals to 3.

By the way, As for another initialization method called Xavier Initialization, I find that the derivation is also correct when fixing ${\bf W}_l$.