How do details of the proof of the universal approximation theorem work?

149 Views Asked by At

So I have a proof of the universal approximation theorem and need to understand how it works. So this is the theorem:

Let $ d \in \mathbb{N} $, let $K \subseteq \mathbb{R}^d $ be compact, and let $\varrho \in L_{loc}^{\infty}(\mathbb{R})$ be an activation function such that the closure of the points of discontinuity of $\varrho$ is a Lebesgue null set. Further let \begin{equation*} \tilde{ \mathcal{F}} = \bigcup\limits_{n \in \mathbb{N}} \mathcal{F}_{((d, n, 1), \varrho)} \end{equation*} be the corresponding set of two-layer NN realizations. Then it holds that $C(K) \subset cl(\tilde{ \mathcal{F}})$ (where the closure is taken with respect to the topology induced by the $L^{\infty}(K)$ - norm) if and only if there does not exist a polynomial $p: \mathbb{R} \rightarrow \mathbb{R}$ with $p=\varrho$ almost everywhere.

So now the proof states that the Hahn-Banach theorem gives us the equivalence of $\tilde{ \mathcal{F}}$ being dense in a real normed vector space $S$ and the existence of $ w \in \mathbb{R}^d$ and $b \in \mathbb{R}$ for non trivial functionals F in the topological dual space S such that $F(\varrho(\langle w, \cdot\rangle + b)) \neq 0$. But I currently don't see how the Hahn Banach theorem can give us such an equivalence. Has someone maybe an idea how this works and could give me some inputs so I can understand this? Thank you very much!

Edit: So I have new info: For a normed space X and a subspace Y there holds the equivalence:

  1. Y is dense in X
  2. $\forall x' \in X'$, $x'=0$ on Y implies $x'=0$

So if we simply negate the second part of the equivalence we get what we need: that we simply can find an argument in the subset for a non trivial functional, where the functional is not equal to zero or when the function is not everywhere zero we can find one of those points in the subspace. For the UAT this means we can find such a neural network, but why can it be portrayed with this kind of parameters don't we need another linear transformation or even more dimensions?

Edit 2: Further definitions so you don't need any background with neural networks.

$\tilde{\mathcal{F}}_a = \{ \Phi_a(\cdot, \theta) ~ : \theta \in \mathcal{R}^{P(N)} \}$

\begin{align*} &\Phi^{(1)}(x, \theta)=W^{(1)}x + b^{(1)}, \\ &\bar{\Phi}^{(l)}(x, \theta)=\tilde{\varrho}(\Phi^{(l)}(x, \theta)), \hspace{5.2em} l \in \{1, \dots, L-1\} \text{ and } \\ &\Phi^{(l+1)}(x, \theta)=W^{(l+1)}\bar{\Phi}^{(l)}(x, \theta) + b^{(l+1)} \quad l \in \{1, \dots, L-1\} \end{align*} with $\varrho $ being applied componentwise.(I use the definition $\tilde{\varrho}$)

This means in our case the functions look like: $\Phi_a(x, \theta)=W\tilde{\rho}\bigl(Zx+a\bigr)+b$

1

There are 1 best solutions below

0
On BEST ANSWER

Okay so I think I found the solution: Let's assume $f \in \mathcal{\tilde{F}}_a $, so the function looks like $f(x)=\Phi_a(x, \theta)=W\tilde{\rho}\bigl(Zx+a\bigr)+b$. By calculating the matrix products we can rewrite it: $f(x)= W_1\varrho \bigl(\langle Z_1, x \rangle + a_1)\bigr)+ \ \dots \ +W_n\varrho \bigl(\langle Z_n, x \rangle + a_n)\bigr) + b$.

Due to the linearity of F, we can assume that at least one part of the summation has to be unequal to zero, so that $F(f) \neq 0$, and therefore also the input of F (linear components of f): either $\varrho \bigl(\langle Z_i, x \rangle + a_i)\bigr)$ or $ b$.

But it needs to be $\varrho \bigl(\langle Z_i, x \rangle + a_i)\bigr)$ otherwise our activation function $\varrho$ would be identical to zero (We can choose from all the parameters.)

Therefore we get parameters such that $F(\varrho(\langle w, \cdot\rangle + b)) \neq 0$.