Gradient Descent via Weight Parameter

55 Views Asked by At

Let $f_{\Theta}: \mathbb{R}^{2} \rightarrow \mathbb{R}^{2}$ be a neural network such that $$ f_{\Theta}(\mathbf{x})=W^{(2)} \sigma\left(W^{(1)} \mathbf{x}+\mathbf{b}^{(1)}\right)+\mathbf{b}^{(2)} $$ where $$ W^{(1)} \in \mathbb{R}^{3 \times 2}, W^{(2)} \in \mathbb{R}^{2 \times 3}, \mathbf{b}^{(1)} \in \mathbb{R}^{3}, \mathbf{b}^{(2)} \in \mathbb{R}^{2} $$ and $\sigma$ is the ReLU function. Suppose the parameter $\Theta=\left\{W^{(1)}, W^{(2)}, \mathbf{b}^{(1)}, \mathbf{b}^{(2)}\right\}$ is initialized as $$ W^{(1)}=\left[\begin{array}{ccc} 1 & 0 & -1 \\ -1 & -1 & 1 \end{array}\right]^{\top}, W^{(2)}=\left[\begin{array}{ccc} 0 & -2 & 1 \\ 1 & -1 & -1 \end{array}\right], \mathbf{b}^{(1)}=\left[\begin{array}{lll} 0 & 0 & 1 \end{array}\right]^{\top}, \mathbf{b}^{(2)}=\left[\begin{array}{ll} 1 & 0 \end{array}\right]^{\top} . $$ To minimize the $L^{2} \operatorname{loss} \ell(\Theta)=\frac{1}{2}\left\|\mathbf{y}-f_{\Theta}(\mathbf{x})\right\|^{2}$, one have to use the gradient descent method with a learning rate $\gamma=1$. How is it possible to evaluate $\Theta$ for two iterations of the optimization when we have a training data? $$ \mathcal{D}=\left\{\left(\left[\begin{array}{ll} 2 & -3 \end{array}\right]^{\top},\left[\begin{array}{ll} -4 & 0 \end{array}\right]^{\top}\right)\right\} $$

In the least, if gradients could be evaluated, problem is easily manageable. However, what is the main principle for calculating the gradients w.r.t. weight parameter matrix?

1

There are 1 best solutions below

2
On

The algorithm to compute such gradient is called backpropagation. It is an instance of reverse differentiation. You have to compute $\frac{\partial l} {\partial \mathbf{W}_k}$ in a reverse manner using chain rule.