I'm having trouble understanding the derivatives in the backpropagation algorithm. I'll use the example presented here.
If you're unfamiliar with the algorithms I'm talking about - it's Okay, my question is only about derivatives.
So I have the following functions:
$$ x_1 = W_1x_0$$ $$ x_2 = f_1(x_1)$$ $$E = \frac{1}{2} || x_2 - y||^2$$
where $x_0$ is a vector of size 4x1, $W_2$ is a matrix of size 5x4, and $f_1$ is some nonlinear function (for example, the logistic function). $y$ is a vector with the same dimension as $x_2$.
Now, I need to take the derivative of E w.r.t. $W_1$. I'll use the chain rule:
$$ \frac{\partial E}{\partial W_1} = \frac{\partial E}{\partial x_2} \frac{\partial x_2}{\partial x_1} \frac{\partial x_1}{\partial W_2}$$
I can understand the first derivative: the derivative of a scalar (that comes from the function E) w.r.t. a vector is a vector.
I'm not sure about the next part. The derivative of $x_2$ w.r.t. $x_1$ is the derivative of a vector w.r.t. a vector. Isn't that supposed to be a matrix, somehow?
And the part I least understand is the last: The derivative of $x_1$ w.r.t. $W_1$. Isn't it impossible to take the derivative of a vector w.r.t. a matrix?
For clarity, rather than subscripts, give every variable a distinct name and write down the definition and differential for each $$\eqalign{ v &= Wx &\implies dv=dW\,x \cr s &= \sigma(v) &\implies ds = (s-s\circ s)\circ dv \cr S &= {\rm Diag}(s) &\implies ds = (S-S^2)\,dv \cr E &= \frac{1}{2}(s-y):(s-y) &\implies dE=(s-y):ds \cr }$$ where (:) denotes the trace/Frobenius product, $A:B={\rm tr}(A^TB)$
and $(\circ)$ denotes the elementwise/Hadamard product.
The logistic function $\sigma(v),\,$ was chosen as a concrete example of an activation function.
Now it's just a matter of successively substituting differentials $$\eqalign{ dE &= (s-y):ds \cr &= (s-y):(S-S^2)\,dv \cr &= (S-S^2)(s-y):dW\,x \cr &= (S-S^2)(s-y)x^T:dW \cr \cr \frac{\partial E}{\partial W} &= (S-S^2)(sx^T-yx^T) \cr\cr }$$ The nice thing about the differential approach is that you don't need to deal with awkward higher-order tensors, such as the gradient of a vector with respect to a matrix. Whereas the differential of a vector (or matrix) behaves just like any other vector (or matrix).