I am trying to understand the way that the writer is counting the width and depth of a ReLU feedforward neural network (FNN) that we implement in the proof. In this paper, we define the width of a ReLU FNN as the max width of a single hidden layer, and the depth as the number of hidden layers. I have attached an image of the proof below.
Now, I understand that $\max(x_1,x_2)$ as implemented in the proof has depth 1, as we've applied the activation function $\sigma$ 1 time, and width $4$, as the four neurons in the single hidden layer are represented by $\sigma(x_1+x_2), \sigma(-x_1-x_2), \sigma(x_1-x_2), \sigma(-x_1+x_2)$.
Then when implementing $\max(x_1,x_2,x_3)=\max(\max(x_1,x_2), x_3)$ we obtain
\begin{align} \max(x_1,x_2,x_3)&=\frac{1}{2}\sigma\left(\frac{1}{2}\sigma(x_1+x_2)-\frac{1}{2}\sigma(-x_1-x_2)+\frac{1}{2}\sigma(x_1-x_2)+\frac{1}{2}\sigma(-x_1+x_2)+\sigma(x_3)-\sigma(-x_3)\right)\\ &\phantom{=}-\frac{1}{2}\sigma\left(-\frac{1}{2}\sigma(x_1+x_2)+\frac{1}{2}\sigma(-x_1-x_2)-\frac{1}{2}\sigma(x_1-x_2)-\frac{1}{2}\sigma(-x_1+x_2)-\sigma(x_3)+\sigma(-x_3)\right)\\ &\phantom{=}\frac{1}{2}\sigma\left(\frac{1}{2}\sigma(x_1+x_2)-\frac{1}{2}\sigma(-x_1-x_2)+\frac{1}{2}\sigma(x_1-x_2)+\frac{1}{2}\sigma(-x_1+x_2)-\sigma(x_3)+\sigma(-x_3)\right)\\ &\phantom{=}+\frac{1}{2}\sigma\left(-\frac{1}{2}\sigma(x_1+x_2)+\frac{1}{2}\sigma(-x_1-x_2)-\frac{1}{2}\sigma(x_1-x_2)-\frac{1}{2}\sigma(-x_1+x_2)+\sigma(x_3)-\sigma(-x_3)\right)\\ \end{align}
and so this is depth $2$, since we've applied the activation function twice and width $6$, as there are six neurons in the first hidden layer which are transformed linearly before the activation function is applied again.
Now, I am confused with the immediate conclusion that $\mathrm{mid}(x_1,x_2,x_3)$ as defined in the proof is width $14$.
I believe it would be width $14$ if we implemented $\min(x_1,x_2,x_3)$ by first implementing $\min(x_2,x_3)=\frac{x_2+x_3-|x_2-x_3|}{2}$ as
$$\min(x_2,x_3)=\frac{1}{2}\sigma(x_2+x_3)-\frac{1}{2}\sigma(-x_2-x_3)-\frac{1}{2}\sigma(x_2-x_3)-\frac{1}{2}\sigma(-x_2+x_3)$$
and then implementing $\min(x_1, x_2, x_3)=\min(x_1,\min(x_2,x_3))$ analogously to the way we did for $\max$ above. This is because the first hidden layer would have neurons $$\sigma(x_1+x_2), \sigma(-x_1-x_2), \sigma(x_1-x_2), \sigma(-x_1+x_2),$$ $$\sigma(x_2+x_3), \sigma(-x_2-x_3),\sigma(x_2-x_3),\sigma(-x_2+x_3),$$ $$\sigma(x_1), \sigma(-x_1), \sigma(x_2), \sigma(-x_2), \sigma(x_3), \sigma(-x_3),$$ where the $\sigma(x_2), \sigma(-x_2)$ would come in by implementing
$$\sigma(\pm x_1 \pm x_2\pm x_3)=\sigma(\pm\sigma(x_1)\mp\sigma(-x_1)\pm\sigma(x_2)\mp\sigma(-x_2)\pm\sigma(x_3)\mp\sigma(-x_3)).$$
However, if we instead implemented $\min(x_1, x_2, x_3)$ as
$$\min(x_1,x_2)=\frac{1}{2}\sigma(x_1+x_2)-\frac{1}{2}\sigma(-x_1-x_2)-\frac{1}{2}\sigma(x_1-x_2)-\frac{1}{2}\sigma(-x_1+x_2),$$ would the resulting $\mathrm{mid}(x_1,x_2,x_3)$ not have width $10$? Since the first hidden layer would only have neurons
$$\sigma(x_1+x_2), \sigma(-x_1-x_2), \sigma(x_1-x_2), \sigma(-x_1+x_2),$$ $$\sigma(x_1), \sigma(-x_1), \sigma(x_2), \sigma(-x_2), \sigma(x_3), \sigma(-x_3).$$
I am trying to make sure that my understanding of counting width/depth is correct, or if I am overcomplicating the computation, and also trying to understand if there would be some benefit of showing it can be implemented with width $14$ versus width $10$. I have not gotten to the proofs of the main theorems of the paper yet, as I'm still trying to understand the basics of the lemma's first, so maybe there is a benefit that I haven't seen yet.
