While I've read a paper, Deterministic variational inference for robust bayesian neural networks, I have some confusion parts.
On the contrary to a standard neural network framework, the author apply a activation function, followed by linear combination as follows.
$$(1) \quad \quad h^{(l)}=f(a^{(l-1)})$$ $$(2) \quad \quad a^{(l)} = h^{l}W^l+b^l$$
And the author claims that infinite linear combination of $h^l$ converges to Normal Distribution by Central Limit Theorem as below.
$a^{(l)}$ is constructed by linear combination of many distinct elements of $h'$, and in the limit of vanishing correlation between terms in this combination, we can appeal to the central limit theorem (CLT). Under the CLT, for a large enough hidden dimension and for variational distributions with finite first and second moments, elements $a_i$ will be normally distributed regardless of the potentially complicated distribution for $h_j$ induced by $f$
I don't understand why CLT could be applied to linear combination of activations in this case. Would anyone elaborate this one?