Calculate derivative in the context of backpropagation

50 Views Asked by At

I have received the following problem:

Concider the following simple model of a neuron

z = wx + b logits,

yˆ = g(z) activation,

L2 (w, b) = 12 (y − ŷ)^2 quadratic loss (Mean Squared Error (MSE), L2 loss, l2-norm),

L1 (w, b) = |y − yˆ| absolut value loss (Mean Absolut Error (MAE), L1 loss, l1-norm),

with x,w,b ∈ R. Calculate the derivatives ∂L/∂w and ∂L/∂b for updating the weight w and bias b. Determine the results for both loss functions (L1, L2) and assume a sigmoid and a tanh activation function g(z). Write down all steps of your derivation. Hint: You have to use the chain rule.

I have considered the following approach for example L2 derived after w: \begin{equation} L_2 = \frac{1}{2} (y - \hat{y})^2 = \frac{1}{2} \left( y - g(z) \right)^2 = \frac{1}{2} \left( y - g(wx + b) \right)^2 = \frac{1}{2} \left( y - \frac{1}{1 + e^{-wx + b}} \right)^2 \end{equation}

This expression could then be easily derived using the chain rule.
However, in the literature I find the following approach:
\begin{equation} \frac{\partial L_2}{\partial w} = \frac{\partial L_2}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial z} \times \frac{\partial z}{\partial w} = -(y - \hat{y}) \times g'(z) \times x \end{equation}

So which of the two approaches is the right one in this context? Or is there even an adner connection?