Why is it advantageous for inputs/targets to most ML algorithms like neural nets to be normally distributed? I am not talking about mean normalization, but in some cases of skewed data, people perform log transform to make it more normal. Example: https://medium.com/ml-byte/rare-feature-engineering-techniques-for-machine-learning-competitions-de36c7bb418f here the last tip says to convert targets by log(1+target) to do so. Why? Why does it become easier to fit or improve accuracy?
P.S I know from studying linear models that those models model how variances in the input affect the variance in the output. So the mean and variances that we normalize by are just shift and scaling factors that act on the true relationship between the variables. But i'm not sure of log transforms and such and why it easier here.
Logs are monotonic transforms, so they don't actually change the locations of optima. However, they can lead to far more numerically stable algorithms, especially when exponentials of large numbers or products of many small probabilities are involved. From this perspective, the resulting "Gaussianity" is incidental.
Neural networks prefer to keep things on the same scale. When taking the log of targets (or features) that are distributed with long tails, one is transforming an "exponentially spread out" variable with one that is "linearly spread out", meaning we go from spanning many orders of magnitude to just a couple. But why does the ANN care? Because deep networks are (roughly) just stacks of linear transforms with (usually) a simple non-linear rectifier like relu between them. This means the models has to figure out how to deal with numbers that can be very tiny or gargantuan, mostly via linear transforms (the best that relu can help is by deleting negative numbers)! Remember that these same weights have to be able to handle data points spanning many orders of magnitude; receiving a huge value will be overwhelming for both the forward and backward passes.
Coming from a purely data generalization perspective, this problem with long tails can also be exacerbated if those tails are sparsely populated (as is usually the case with long tails). For targets, one way to view this is as a class imbalance problem with the dataset. By taking the log, we crush down those vast swathes of distance among the tails of the distribution and avoid this. For features, this is more like the dataspace is artificially empty on the tails. Suppose (as in the target dist for the linked example) we have a feature distribution with lots of data in 1M-5M, but very little on the tails, yet with small spikes at 30M and 40M. This means that inputs with values between 30M and 40M are quite possible, but the network may not have seen anything with those values due to the sparseness of the tails. By taking the log, we avoid having barren parts of the data space that the network has to interpolate and/or extrapolate into (since it will be put 30M and 40M very close to each other). If all the recent work in adversarial examples has taught us anything, it is that networks cannot be trusted even a small distance into uncharted territory!
From a theory perspective, Gaussianity is always appealing. This can also be practical: for example, variational Bayesian neural networks are often approximated with mean field methods, meaning the weights are given Gaussian posterior representations, which seems to be successful in practice. More theoretically, Bayesian neural networks (for which regular ANNs are a special case) are closely related to Gaussian processes. Separately, by having a Gaussian distributed error model, one ends up with a maximum likelihood model by minimizing $L_2$ regression error. In general, in virtually all applications, one prefers to minimize a log-likelihood, rather than a likelihood probability directly. Altogether, I don't think the reason is theoretically motivated; however, there are some theoretical interpretations for it.