About running variance in Batch Normalization layer

57 Views Asked by Bumbble Comm At 27 Mar 2026 - 7:30

A batch normalization(BN) layer is normally used to reduce the covariance shit problem in neural networks. Where in a layer, input $x$ will be normalized to something like $x^{\prime}$ = $\frac{x-\mu(x)}{var(x)}$. And the mean and variance are normally running mean and variance according to statistical knowledge of inputs.

What I want to ask is that let's say we have two datasets $D_1, D_2$, and two models learned from them respectively, $F_1, F_2$.

Let's first define $F_1, F_2$ as simple MLPs with the form of $W_2(BN(W_1(x)))$, where $W_1, W_2$ are two linear functions(I ignore the notation of the bias and activation term for simplicity).

What kind of conclusion we can possibly get if $F_1$'s BN layer has a much smaller learned running variance compared to $F_2$?

My assumption is that if the BN layer has a smaller running variance, it maybe shows that other layers in the model are capable of recognizing all of its inputs, so layers can encode inputs into a much more steady, smaller space.

Does that means $F_1$ is actually more invariant to $D_1$ than $F_2$ is to $D_2$?

If someone has an answer or wants to discuss it, please drop a comment below, thank you.

Also, my problem setup might be not enough, I'm happy to add more definitions/assumptions to it.

Original Q&A

About running variance in Batch Normalization layer

Related Questions in MACHINE-LEARNING

Related Questions in VARIANCE

Related Questions in INVARIANT-THEORY

Related Questions in INVARIANCE

Related Questions in INVARIANT-SUBSPACE

Trending Questions

Popular # Hahtags

Popular Questions