Understanding the derivatives in backpropagation algorithm

778 Views Asked by At

I'm having trouble understanding the derivatives in the backpropagation algorithm. I'll use the example presented here.

If you're unfamiliar with the algorithms I'm talking about - it's Okay, my question is only about derivatives.

So I have the following functions:

$$ x_1 = W_1x_0$$ $$ x_2 = f_1(x_1)$$ $$E = \frac{1}{2} || x_2 - y||^2$$

where $x_0$ is a vector of size 4x1, $W_2$ is a matrix of size 5x4, and $f_1$ is some nonlinear function (for example, the logistic function). $y$ is a vector with the same dimension as $x_2$.

Now, I need to take the derivative of E w.r.t. $W_1$. I'll use the chain rule:

$$ \frac{\partial E}{\partial W_1} = \frac{\partial E}{\partial x_2} \frac{\partial x_2}{\partial x_1} \frac{\partial x_1}{\partial W_2}$$

I can understand the first derivative: the derivative of a scalar (that comes from the function E) w.r.t. a vector is a vector.

I'm not sure about the next part. The derivative of $x_2$ w.r.t. $x_1$ is the derivative of a vector w.r.t. a vector. Isn't that supposed to be a matrix, somehow?

And the part I least understand is the last: The derivative of $x_1$ w.r.t. $W_1$. Isn't it impossible to take the derivative of a vector w.r.t. a matrix?

1

There are 1 best solutions below

0
On

For clarity, rather than subscripts, give every variable a distinct name and write down the definition and differential for each $$\eqalign{ v &= Wx &\implies dv=dW\,x \cr s &= \sigma(v) &\implies ds = (s-s\circ s)\circ dv \cr S &= {\rm Diag}(s) &\implies ds = (S-S^2)\,dv \cr E &= \frac{1}{2}(s-y):(s-y) &\implies dE=(s-y):ds \cr }$$ where (:) denotes the trace/Frobenius product, $A:B={\rm tr}(A^TB)$
and $(\circ)$ denotes the elementwise/Hadamard product.
The logistic function $\sigma(v),\,$ was chosen as a concrete example of an activation function.

Now it's just a matter of successively substituting differentials $$\eqalign{ dE &= (s-y):ds \cr &= (s-y):(S-S^2)\,dv \cr &= (S-S^2)(s-y):dW\,x \cr &= (S-S^2)(s-y)x^T:dW \cr \cr \frac{\partial E}{\partial W} &= (S-S^2)(sx^T-yx^T) \cr\cr }$$ The nice thing about the differential approach is that you don't need to deal with awkward higher-order tensors, such as the gradient of a vector with respect to a matrix. Whereas the differential of a vector (or matrix) behaves just like any other vector (or matrix).