Universal approximation of neural networks

33 Views Asked by At

I am currently dealing with the topic of reproducing kernel Hilbert spaces (RKHS) given the draft book of Francis Bach.

As a background knowledge for my current problem define: \begin{align} &H_1=\lbrace f: \mathbb{R}^d \rightarrow \mathbb{R} | f(x)=\int\limits_{\mathbb{R}^d} \eta(w,b)\sigma(w^Tx+b) d \tau(w,b) \quad \text{for some } \eta \in L^1(\mathbb{R^d})\rbrace \\ &\gamma_1(f):=\Vert f \Vert_{H_1} = \inf \int\limits_{\mathbb{R}^d} |\eta(w,b)| d \tau(w,b) \qquad s.t. \qquad f(x)=\int\limits_{\mathbb{R}^d} \eta(w,b)\sigma(w^Tx+b) d \tau(w,b). \end{align} This is also called the variation norm.

Then we want to approximate a function $g$ with $\gamma_1(g)<\infty$ with a 2-layer neural network $$f(x)=\sum\limits_{j=1}^m \eta_j \sigma(w_j^Tx+b_j).$$

Then the author claims in section 9.3.5. "If the input weights are fixed, then the bound on $\gamma_1(g)$ translates to a bound $\Vert \eta \Vert_1 \leq \gamma_1(g)$."

Now I understand that $\Vert \eta \Vert_1 = \gamma_1(f)$, however it seems too less explanation for me to understand why we impose $\gamma_1(f)\leq \gamma_1(g)$. This seems to be an essential condition under which also further computations and algorithmical solutions are considered. So my question is: Why do we impose $\gamma_1(f) \leq \gamma_1(g)$ if we want to approximate a function $g$ with a 2-layered neural network f?

Thank you very much already in advance!