How to obtain the gradient and optionally Hessian of \begin{align} L(W_1, W_2, W_3) := \sum_{i=1}^N \| W_3 \ g\left(W_2 \ f\left(W_1 x_i \right) \right) - y_i \|_2^2 \ , \end{align} with respect to $W_1$, $W_2$, and $W_3$?
The definition of $x_i \in \mathbb{R}^n$, $W_1 \in \mathbb{R}^{m \times n}$, $W_2 \in \mathbb{R}^{p \times m}$, $W_3 \in \mathbb{R}^{q \times p}$, and $y_i \in \mathbb{R}^q$, and $f(z) = g(z) = \frac{1}{1 + \exp(-z)}$.
can we also generalize for any differentiable $f$ and $g$ functions?
Thank you so much in advance for your help
Define some new vectors $$\eqalign{ p &= W_1x &\implies dp = dW_1\,x \cr f &= \sigma(p) &\implies df = (F-F^2)\,dp \cr r &= W_2f &\implies dr = W_2\,df+dW_2\,f \cr g &= \sigma(r) &\implies dg = (G-G^2)\,dr \cr s &= W_3g-y &\implies ds = W_3\,dg+dW_3\,g \cr }$$ where $F={\rm Diag}(f)$ and $G={\rm Diag}(g)$.
Write the loss function in terms of these new variables. $$\eqalign{ L &= \|s\|^2_F = s:s \cr }$$ where the colon is a convenient product notation for the trace, i.e. $\,A:B = {\rm tr}(A^TB)$
Now calculate the differentials and desired gradients. $$\eqalign{ dL &= 2s:ds \cr &= 2s:(W_3\,dg+dW_3\,g) \cr }$$ Setting $dg=0$ yields our first gradient $$\eqalign{ dL &= 2sg^T:dW_3 \cr \frac{\partial L}{\partial W_3} &= 2sg^T }$$ Now set $dW_3=0$ and continue on towards $W_2$. $$\eqalign{ dL &= 2W_3^Ts:dg \cr &= 2W_3^Ts:(G-G^2)\,dr \cr &= 2(G-G^2)W_3^Ts:(W_2\,df+dW_2\,f) \cr }$$ Setting $df=0$ yields our second gradient $$\eqalign{ dL &= 2(G-G^2)W_3^Tsf^T:dW_2 \cr \frac{\partial L}{\partial W_2} &= 2(G-G^2)W_3^Tsf^T }$$ Now set $dW_2=0$ and continue on towards $W_1$. $$\eqalign{ dL &= 2W_2^T(G-G^2)W_3^Ts:(F-F^2)\,dp \cr &= 2(F-F^2)W_2^T(G-G^2)W_3^Ts:dW_1\,x \cr &= 2(F-F^2)W_2^T(G-G^2)W_3^Tsx^T:dW_1 \cr \frac{\partial L}{\partial W_1} &= 2(F-F^2)W_2^T(G-G^2)W_3^Tsx^T \cr }$$ Actually we've only worked with the $i^{th}$ component of the loss function, i.e. $L_i$.
The full function or gradient is obtained by summing over all $N$ components. $$\eqalign{ L_{total} &= \sum_{i=1}^N L_i \cr \frac{\partial L_{total}}{\partial W_k} &= \sum_{i=1}^N \frac{\partial L_i}{\partial W_k} }$$ NB: In the derivation, $(x, y)$ were treated a single vectors, but in the summation they must be replaced by $(x_i, y_i)$