Universal Approximation with Fixed Layer

71 Views Asked by At

Fix an activation function $\sigma$, and denote the class of all Neural-networks from $\mathbb{R}\rightarrow\mathbb{R}$ defined by this activation function by $NN^{\sigma}$.

The classical universal approximation theorem for neural networks on $C([0,1];\mathbb{R})$, states that for a fixed nice activation function $\sigma$, any $f\in C([0,1];\mathbb{R})$ and any $\epsilon >0$, there exists some $f_{\epsilon} \in NN^{\sigma}$ satsifying \begin{equation} \sup_{x \in [0,1]} |f_{\epsilon}(x)-f(x)|\leq \epsilon. \label{eq_1} \end{equation} Moreover, only $1$ layer is needed.

My question is, if we instead consider all the neural networks of arbitrary depth but fixed layer, can we still obtain universal approximation, but only by this subset?

1

There are 1 best solutions below

2
On

Let's assume $NN^{\sigma}$ is a set of arbitrarily deep feedforward networks with ReLU activation function and all its layers have constant width $H$.

1) If $H>1$ (more than 1 neuron in each layer), then indeed the universal approximation property holds: $\forall f\in C([0,1];\mathbb{R})~~ \forall \varepsilon > 0 ~~ \exists f_{\varepsilon} \in NN^{\sigma}~~ \sup_{x \in [0,1]} |f_{\epsilon}(x)-f(x)|\leq \epsilon$.

2) If $H=1$ (only 1 neuron in each layer), then there is a function in $C([0, 1]; \mathbb{R})$ which can't be approximated in this way. So the universal approximation property does not hold.

This is a special case of the result obtained by Hanin and Sellke in 2017. In general it states that the approximation property holds if and only if layer width is strictly larger than input dimension $\nu$.

As of 2019, a number of connected open problems still stand (source: lecture notes from deep learning theory course by D. Yarotsky):

1) Even though in the case of $H=\nu$ approximation property is not universal anymore (it doesn't hold for all functions in $C([0,1];\mathbb{R})$), what are the necessary and sufficient conditions for a function to be approximable with such a network?

2) What are minimal widths of networks for other activation functions?