Gradient Descent Pen & Paper Example

204 Views Asked by At

For an example exercise to understand gradient descent and backpropagation from Neural Networks more I want to calculate the Gradient of a Loss function in dependence on the parameters a,b and c in the following equation in order to perform gradient descent on the loss function:

Given are the following functions:

$$f(x,a)$$ $$g(y,b)$$ $$h(z,c)$$

$$\nabla Loss = \nabla \sum_{n=1}^N(h(g(f(x_n,a),b),c) -y(x_n))^2 $$

so...

$$\hat{y}(a,b,c,x_n) = h(g(f(x_n,a),b),c)$$

then

$$\nabla Loss = \nabla \sum_{n=1}^N(\hat{y}(a,b,c,x_n) -y(x_n))^2 $$

then if I want to calculate the partial derivatives of the loss functions I get stuck...

$$\frac{\partial}{\partial a} Loss = \sum_{n=1}^N\frac{\partial}{\partial a} (h(g(f(x_n,a),b),c) - y(x_n))^2$$

$$\frac{\partial}{\partial b} Loss = ...$$

$$\frac{\partial}{\partial c} Loss = ...$$

Afterwards I should assume that $g(y,b) = y^2+b*y$

How can I calculate the partial derivatives of this chained equation $\hat{y}$?

EDIT: So what need to be solved is the following equation system with only having the function and derivative of g, the other remain unknown:

$$ \frac{\partial Loss}{\partial a} = \frac{\partial Loss}{\partial h} \frac{\partial h}{\partial g} \frac{\partial g}{\partial f} \frac{\partial f}{\partial a}$$

$$ \frac{\partial Loss}{\partial b} = \frac{\partial Loss}{\partial h} \frac{\partial h}{\partial g} \frac{\partial g}{\partial b}$$

$$ \frac{\partial Loss}{\partial c} = \frac{\partial Loss}{\partial h} \frac{\partial h}{\partial c}$$

I hope the way I explained the problem makes sense to you and I'm very grateful for all kinds of help.

Cheers

1

There are 1 best solutions below

2
On

Using the written chain rule to express the gradients of a network gets exceptionally messy very quick.

Fundamentally every gradient in a network is just the sum of all incoming gradients calculated backward from the loss functions. You can imagine a parameter with 15 incoming gradients is very time consuming to express by writing the chain rule for every incoming gradient.

As such, I would suggest instead approaching this task visually. Doing so makes it much simpler, take the following example:

enter image description here

This image shows calculating the gradient through a multiply function (so just $f(a,b) = a*b)$. The incoming gradients being -0.2, 0.4, 0.6. If these were expressed as the chain rule, you would sum these all together. After, you take the derivative of the function $f$ with respect to a, which is b and multiply it by the gradient at that function $f$, and the same for b.

This is how gradients are actually calculated in a network. I think what you're asking in your question is to expand the chain rule to get the exact gradient function at a parameter. This is not useful because it's incredibly expensive to compute. As such, instead the gradient for just one particular value is calculated and "propagated backwards" through the network.

If you still want some more information on this, I would suggest watching this. It really helped in my understanding.