Derivation of the gradient in "Fast dropout training" paper

61 Views Asked by At

I'm reading Fast dropout training paper and have a difficulty in understanding it. In formula (5), page 3, authors derive approximation to $\frac{\partial L(w)}{\partial w_i}$. I can't understand, how do they get penultimate step in these derivations.

By definition we have: $$ \frac{\partial L(w)}{\partial w_i} = \underset{z}{\mathbb{E}}[f(Y(z))x_i z_i] $$ By pulling $z_i$ out from expectation (all $z_i$-s are independent), we get $$ = \sum_{z_i \in \{0,1\}} p(z_i) z_i x_i \underset{z_{-i} | z_i = 1}{\mathbb{E}}[f(Y(z))] $$ Term with $z_i = 0$ vanishes, so we are left with: $$ = p_i z_i x_i \underset{{z_{-i} | z_i = 1}}{\mathbb{E}}[f(Y(z))] $$ And later authors use a "linear approximation to the conditional expectation". What is it? This is the step I do not understand: $$ \approx p_i x_i \Big(\mathbb{E}_S[f(S)] + \Delta \mu_i \frac{\mathbb{E}_S[f(S)]}{\partial \mu} \Big|_{\mu=\mu_S} + \Delta \sigma^2_i \frac{\mathbb{E}_S[f(S)]}{\partial \sigma^2} \Big|_{\sigma^2=\sigma^2_S} \Big) $$ $Y(z)$ is a normally distributed r.v., $S$ is an approximation to it, why we are not fine with just substituting one into another? Why do we use such approximation? What are the intermediate steps to derive it? This does not look like a Taylor approximation, because we use another function $S(z)$ to approximate $Y(z)$, and they do not generally match in their first two orders.