Basically I took a look at this paper and it is clear that, "A standard multilayer feedforward network with a locally bounded piecewise continuous activation function can approximate any, continuous function to any degree of accuracy if and only if the network's activation function is not a polynomial."
Okay, great. So now I know that, generally speaking, any single appropriate non-polynomial function used as the activation functions of a standard multilayered feedforward network is a universal approximator.
But in this scenario (and all the other's I have come across) all the activation functions of the network are the same.
My question is: How can I be certain that universal approximation will hold for a neural network which is an ensemble of two other networks, each with different activation functions (meeting the definition above)? Imagine using Sigmoid in one network and Tanh in another in such a way that each independently would meet the above criteria.
Clearly universal approximation seems to apply as people ensemble all the time, but how can we be certain it will always happen?