If linear functions shouldn't be used as activation functions, why can polynomials?

266 Views Asked by At

In my understanding, in a neural network one aims to approximate a function with something like $$f_1(W_1 f_2( W_2 f_3(W_3 x))) = y(x)$$ and if $f_1, f_2$ and $f_3$ are linear, they can just be written as $f$ and we'll instead have something like $f(W_1W_2W_3 x) = y(x)$. $W_1W_2W_3$ is simply some matrix $W$, so it can be written as a linear classifier $f(W x) = y(x)$. This in neural networks terms amounts to collapsing a neural network into just an input and output layer, and only having one weights matrix to optimize.

As far as I can understand, since we're not really "building" a function like with a Fourier series as we aren't using a sum, I can't really motivate my argument for wanting to not be able to collapse functions $f_1, f_2, f_3$ into one function $f$ other than the fact that we want to not be able to collapse our $3$ weight matrices into one so we have more weight matrices to minimize which is clearly a good thing (I can't really say beyond why I think this is good other than to say that allowing more and more Fourier coefficients to be nontrivial can only really help you up to a point when trying to approximate a function -- allowing one to have more degrees of freedom like this is usually a good thing).

If $f_1, f_2, f_3$ are non-linear, they can no longer be combined into one $f$. For instance, if $f_1 = 2\sin(x)$ and $f_2 = 3\sin(23x)$, $f_1 + f_2 = f_1 + f_2$ as the two functions cannot be collapsed into one function as they're not linear.

However, polynomials such as $f_1 = x^2 + 4$ and $f_2 = 3 x^2 -1$ are non-linear and can be combined into some function $f = (3+1)x^2 + (4-1)$.

So are all non-linear functions created equal in this case? Or, when choosing a non-linear function for a neural network (and disregarding anything else about what output range and domain we want) do we need to make sure we choose functions that cannot be simply combined into one function?

1

There are 1 best solutions below

0
On

Polynomials are terrible activation functions for neural networks, precisely for the reason you cite.

Fix a continuous activation function $\sigma : \mathbb{R} \to \mathbb{R}$. We define the space of (single-layer) neural networks activated by $\sigma$ by $$\mathcal{N}_\sigma = \left\lbrace f(x) = \sum_{j=1}^N \sigma (y_j^T x + \theta_j) \, | \, y_j \in \mathbb{R}^d, \theta_j \in \mathbb{R}, N \in \mathbb{Z}_{\geq 1} \right\rbrace$$ We can study this set as a subspace of $C[0,1]^d$ with uniform norm. For neural networks to be useful, it is reasonable to require that they can express continuous functions, so we may ask that $\overline{\mathcal{N}_\sigma} = C[0,1]^d$. In the literature, people call this density property a "universal approximation theorem", or $\mathcal{N}_\sigma$ a class of "universal approximators".

If $\sigma$ is a polynomial, then $\mathcal{N}_\sigma$ is finite-dimensional, so it cannot be dense! This remains true if you extend $\mathcal{N}_\sigma$ to have multiple layers. Thus, polynomials are not good activation functions.

For a brief discussion on when $\overline{\mathcal{N}_\sigma} = C[0,1]^d$ holds, see, for instance, Non-trivial examples of non-discriminatory functions. Additionally, I recommend reading Universal Approximation Theorem — Neural Networks for more intuition behind the universal approximation result.