A batch normalization(BN) layer is normally used to reduce the covariance shit problem in neural networks. Where in a layer, input $x$ will be normalized to something like $x^{\prime}$ = $\frac{x-\mu(x)}{var(x)}$. And the mean and variance are normally running mean and variance according to statistical knowledge of inputs.
What I want to ask is that let's say we have two datasets $D_1, D_2$, and two models learned from them respectively, $F_1, F_2$.
Let's first define $F_1, F_2$ as simple MLPs with the form of $W_2(BN(W_1(x)))$, where $W_1, W_2$ are two linear functions(I ignore the notation of the bias and activation term for simplicity).
What kind of conclusion we can possibly get if $F_1$'s BN layer has a much smaller learned running variance compared to $F_2$?
My assumption is that if the BN layer has a smaller running variance, it maybe shows that other layers in the model are capable of recognizing all of its inputs, so layers can encode inputs into a much more steady, smaller space.
Does that means $F_1$ is actually more invariant to $D_1$ than $F_2$ is to $D_2$?
If someone has an answer or wants to discuss it, please drop a comment below, thank you.
Also, my problem setup might be not enough, I'm happy to add more definitions/assumptions to it.