Let $f_{\Theta}: \mathbb{R}^{2} \rightarrow \mathbb{R}^{2}$ be a neural network such that $$ f_{\Theta}(\mathbf{x})=W^{(2)} \sigma\left(W^{(1)} \mathbf{x}+\mathbf{b}^{(1)}\right)+\mathbf{b}^{(2)} $$ where $$ W^{(1)} \in \mathbb{R}^{3 \times 2}, W^{(2)} \in \mathbb{R}^{2 \times 3}, \mathbf{b}^{(1)} \in \mathbb{R}^{3}, \mathbf{b}^{(2)} \in \mathbb{R}^{2} $$ and $\sigma$ is the ReLU function. Suppose the parameter $\Theta=\left\{W^{(1)}, W^{(2)}, \mathbf{b}^{(1)}, \mathbf{b}^{(2)}\right\}$ is initialized as $$ W^{(1)}=\left[\begin{array}{ccc} 1 & 0 & -1 \\ -1 & -1 & 1 \end{array}\right]^{\top}, W^{(2)}=\left[\begin{array}{ccc} 0 & -2 & 1 \\ 1 & -1 & -1 \end{array}\right], \mathbf{b}^{(1)}=\left[\begin{array}{lll} 0 & 0 & 1 \end{array}\right]^{\top}, \mathbf{b}^{(2)}=\left[\begin{array}{ll} 1 & 0 \end{array}\right]^{\top} . $$ To minimize the $L^{2} \operatorname{loss} \ell(\Theta)=\frac{1}{2}\left\|\mathbf{y}-f_{\Theta}(\mathbf{x})\right\|^{2}$, one have to use the gradient descent method with a learning rate $\gamma=1$. How is it possible to evaluate $\Theta$ for two iterations of the optimization when we have a training data? $$ \mathcal{D}=\left\{\left(\left[\begin{array}{ll} 2 & -3 \end{array}\right]^{\top},\left[\begin{array}{ll} -4 & 0 \end{array}\right]^{\top}\right)\right\} $$
In the least, if gradients could be evaluated, problem is easily manageable. However, what is the main principle for calculating the gradients w.r.t. weight parameter matrix?
The algorithm to compute such gradient is called backpropagation. It is an instance of reverse differentiation. You have to compute $\frac{\partial l} {\partial \mathbf{W}_k}$ in a reverse manner using chain rule.