Approximating Lipschitz Functions by Sigmoidal Functions

412 Views Asked by At

We define a sigmoidal function $\sigma: \mathbb{R} \to [0,1]$ as a non-decreasing, continuous function with $\lim_{x\to\infty}\sigma(x) = 1$ and $\lim_{x\to-\infty}\sigma(x) = 0$. I want to show that:

If $g:\mathbb{R}\to \mathbb{R}$ is a Lipschitz on the interval [a,b], and $\epsilon>0$, then I can find $f\in \left\{\sum_{i=1}^{n}a_i \sigma(u_ix + b_i): a_i,u_i,b_i\in\mathbb{R}, n\geq 0,\right\}$ such that $$\sup_{x\in[a,b]} \left\lvert f(x) - g(x) \right\rvert < \epsilon$$

In other words I want to show I can approximate $g$ with linear combinations of affinely shifted and scaled versions of $\sigma$. I know for large enough $M>0$, $\sigma(Mx)$ acts like the indicator function (i proved that it can approximate lipschitz functions), but I don't know how to go any further.

2

There are 2 best solutions below

1
On BEST ANSWER

Let's do a sketch of proof using your sigmoid nonlinearity activation function $$\sigma\,\colon x \mapsto \frac{1}{1+\mathrm{e}^{-x}}.$$ Other nonlinearity activation will also work, for instance, the popular "ReLU" activation. We intend to prove the following result (at least a sketch of proof is given):

Suppose that $\phi\, \colon A \subset \mathbb{R}^d \to \mathbb{R}$ is continuous, with $A$ being compact. Then for any $\epsilon > 0$, there exists a two-layer neural network $f$ suc that $$\max_{x\in A} |\phi(x) - f(x)| < \epsilon.$$

Sketch of Proof: We only consider the case $d=1$, the key is to show that we can approximate $\phi$ by a combination of step functions: $$\phi(x) \approx \sum_{j=1}^n h_j\,\mathbf{1}_{[a_j,b_j]}(x).$$ Indeed, to approximate a step function ''$h\,\mathbf{1}_{[a,b]}(x)$'', we form $$f(x) = h\,\sigma(\lambda\,(x-a)) - h\,\sigma(\lambda\,(x-b))$$ for $\lambda >> 1$. The idea/intuition behind this crucial construction is that $\sigma(x)$ can be seen as a smoothed version of a step function (by the way, $\sigma$ or the true step function mimics the behavior of a neuron in the brain). Baically, the factor $\lambda >>1 $ has sharpened the changeover and the shift $-a$ (or $-b$) has altered its location. Therefore, to approximate $$\phi(x) \approx \sum_{j=1}^n h_j\,\mathbf{1}_{[a_j,b_j]}(x),$$ we can construct $$f(x) = \sum_{j=1}^n h_j\,\sigma(\lambda\,(x-a_j)) - h\,\sigma(\lambda\,(x-b_j)).$$ Notice that for each step function, we need $2$ neurons for approximation purposes. The only crime we made here is that the resulting approximation turns out to be in $L^2$, and not $L^\infty$. Hope these heuristics help. As a final remark: In the case where $\sigma$ stands for the ReLU activation, a rigorous proof for $L^\infty$ approximation can be carried out without too much work.

0
On

This is a consequence of the so-called universal Approximation theorem for neural networks; see https://en.m.wikipedia.org/wiki/Universal_approximation_theorem.

This theorem in fact shows that your claim holds for any continuous function - Lipschitz continuity is not needed.

The proof is not entirely trivial. One possible proof is based on the Hahn-Banach theorem.