Is there a mathematically proven way to represent linear transformation L as a chain of non-linear transformations?
If yes, this could solve the problem of Link matrix growing too large (quadratically) as number of dimensions $(N)$ increases. This is important for a neural network called Differentiable Neural Computer, where the Link matrix consumes a lot of computer memory.
Assume an matrix $L \in [0,1]^{N \times N}$ that can take vector $r \in [0,1]^N$ and produce a vector $ b \in[0,1]^N $
The problem arises when N grows larger - the number of entries in $L$ increases quadratically.
Is it possible to represent the transformation $d = L \cdot r$ as a transformation through three smaller matrices? For example
$$d = \sigma( C \cdot \sigma (B \cdot \sigma (A \cdot r)))$$
where $A$ is $10 \times 1000$ matrix
$B$ is $10 \times 10$ matrix
$C$ is $1000 \times 10$ matrix
$\sigma$ is a non-linear transformation, for example a softmax
The trick would be to know how to "decompose" L into of $A$, $B$, $C$, so that together they represent $L$ exactly. So instead of the traditional "back-propagation", could we use some mathematical formula to immediately calculate the exact values of entries in matrices $A$, $B$, $C$?
This way we would represent L by only using 10 000 + 100 + 10 000 entries instead of 1 million.
Which area of mathematics should I look into, to learn how it's done?
Just thought of this after glancing at this Stackoverflow answer.
There is a theoretical result which says that an MLP with only one hidden layer can fit every function of interest up to an arbitrary low error margin if this hidden layer has enough neurons. However, the number of parameters might be MUCH larger than if you add more layers.
This is impossible. If you truly wish to represent a space of systems with a million parameters, it cannot be done in fewer than a million parameters. Linearity has nothing to do with this result, although it is rather hard to imagine that throwing in non-linear steps is going to help you represent a linear transformation (instead of, as is usually the case, just destroying linearity of the composition). However, this formal mathematical result doesn't say anything about whether it could be practically useful to replace a linear transformation with a series of smaller ones and, indeed, this is a reasonably common feature of machine learning.
The formal reasoning is relatively straightforwards given that you're looking at differentiable systems*. Basically, you are trying to find a space $\mathbb R^d$ - considered as the space of parameters for your revised system - and a differentiable and surjective map of that into $\mathbb R^{10^6}$. It doesn't matter how you try to do this, you can't succeed unless $d\geq 10^6$ simply because, otherwise, the $10^6$-dimensional volume of the image is $0$. You might be able to find maps that have desirable properties (e.g. the image can reasonably well approximate anything, under some criteria), but "hit everything" is not possible.
(*Of course, if you only care about continuity, there are space-filling curves, but that's an abstract result about the unit interval, and does not apply to finite precision arithmetic, even if we ignore that we really want differentiability).