Can every continuous piecewise linear function $[-1,1]^k \rightarrow \mathbb{R}^n$ be written as a composition of the following building blocks:
- Affine map: $x \mapsto Ax + b$ for some matrix $A$ and vector $b$
- Relu activation: $(x_1, x_2, ...) \mapsto (\max(0, x_1), \max(0, x_2),...)$
If so, how many composition factors are needed? Can every such function be represented by a network with "only one hidden layer":
$$ \text{affine} \circ \text{relu} \circ \text{affine} \circ \text{relu} \circ \text{affine} $$
By piecewise linear, I mean that there exists decomposition of the domain $[-1,1]^k$ into finitely many polytopes, such that the restriction of the function to each polytope is affine.
Long version of my short comment:
First of all, not every piecewise affine linear function can be build by a ReLU neural network with only one hidden layer. The reason is that a compactly supported piecewise affine function, such as, $$ \mathbb{R}^d \ni x \mapsto \max\{0, 1 - \max_{i=1, \dots, d} |x_i| \} $$ cannot be represented by sums of ReLUs. The simple reason is that this function is smooth outside of a compact domain whereas a sum of ReLUs is either affine linear or has at least one line along which it is not smooth. (This is of course something one would need to prove in more detail. A proof can be found in Theorem 4.1 of https://arxiv.org/pdf/1807.03973.pdf.)
On the other hand, it was shown in https://arxiv.org/pdf/1807.03973.pdf that deep ReLU neural networks can represent linear finite elements. This is because one can write these hat functions as a combination of max and min operations. I can only do a worse job than the authors themselves in explaining how this is done. Their paper also has a lot of nice illustrations. Therefore I think it is best to just refer to Chapter 3 of https://arxiv.org/pdf/1807.03973.pdf.
From the construction of hat functions, it follows essentially directly that also all continuous piecewise linear functions can be represented by ReLU neural networks since every such function is a sum of hat functions. This is Theorem 5.2 of the work cited above.