Universal approximation theorem states that "the standard multilayer feed-forward network with a single hidden layer, which contains finite number of hidden neurons, is a universal approximator among continuous functions on compact subsets of $R^n$, under mild assumptions on the activation function."
I understand what this means, but the relevant papers are too far over my level of math understanding to grasp why it is true or how a hidden layer approximates non-linear functions.
So, in terms little more advanced than basic calculus and linear algebra, how does a feed-forward network with one hidden layer approximate non-linear functions? The answer need not necessarily be totally concrete.
I also posted this question at TCS, and CV. Previously, no one had given a solution. But now, here is a really excellent and comprehensive answer.
If the hidden units are radial basis functions (i.e., they have a peak response when the input pattern is close in a Euclidean distance sense to the parameter vector of the hidden unit), then each hidden unit basically generates a "bump". A superposition of such "bumps" can then be used to approximate an arbitrary function. Other types of hidden units such as "sigmoidal" units also have this type of "bump" response property.