Application of chain rule, and some recursion

421 Views Asked by At

Consider the differentiable functions $L^1(x,\theta^1),L^2(x^2,\theta^2),...,L^l(x^l,\theta^l)$, where every $x_k,\theta^k$ are real vectors, for $k=1,...,l$. Also define $\theta=(\theta^1,...,\theta^l)$.

Define the composite function $f(x,\theta)=x^{l+1}$ recursively by doing $x^k= L^{k-1}(x^{k-1},\theta^{k-1})$, $x^1=x$.

Compute $J_\theta f$, the jacobian of $f$ with respect to $\theta$

For some context, I'm trying to implement gradient descent for optimizing the loss function of a neural network, and if my computations are correct I don't understand why we do back-propagation, instead of, say, forward-propagation... Here is my attempt, is there any mistake?

  1. Compute $J f$: using the chain rule: $$ Jf=JL^l(x^l,\theta^l)= \left ( J_{x^l}L^l\cdot J_{x,\theta^1,...,\theta^{l-1}}x^l \middle| J_{\theta^l}L^l\right )= \left ( J_{x^l}L^l\cdot J_{x,\theta^1,...,\theta^{l-1}}L^{l-1} \middle| J_{\theta^l}L^l\right )$$ Hence we can write $Jf=J^l$, where $J^l$ is given by the following recursive rule: $$J^k=\left ( J_{x^k}L^k\cdot J^{k-1}\middle| J_{\theta^k}L^k\right ), \quad J^1=J_{x,\theta^1}L^1$$

  2. Obtain $J_\theta f$: we want to obtain the last columns of $Jf$, corresponding to the derivatives with respect to $\theta^1,...,\theta^l$. Clearly $$J_\theta f=\left ( J_{x^l}L^l\cdot J_{\theta^1,...,\theta^{l-1}}L^{l-1} \middle| J_{\theta^l}L^l\right )$$ Hence $J_\theta f=G^l$, where: $$G^k=\left ( J_{x^k}L^k\cdot G^{k-1}\middle| J_{\theta^k}L^k\right ), \quad G^1=J_{\theta^1}L^1$$

3

There are 3 best solutions below

2
On BEST ANSWER

It's straightforward to see that the gradient of the output with respect to all the parameters can be computed in a recursive, forward manner (as you have shown above). This procedure is called the forward-mode differentiation. The well-known backpropagation algorithm, on the other hand, is a special case of the reverse-mode differentiation, which is much harder to see (that's why its invention is appreciated).

The question is, if the forward-mode differentiation is straightforward, why do people keep using the reverse mode?

The answer lies in the computational efficiency of the reverse mode. Indeed, for a general computational graph, if the dimension of the input is much larger than the output's, then the reverse mode is much more efficient (and vice-versa). This is a well-known result in automatic differentiation (see e.g. Who Invented the Reverse Mode of Differentiation? by Griewank).

It turns out that, in machine learning, the so-called training task often involves the gradient of a scalar-valued objective function with respect to a large number of parameters, i.e. the dimension of the output (1d) is much smaller than the dimension of the parameter vector (as well as the dimension of the input features), and thus the reverse-mode differentiation is much more efficient in this case.

(Try deriving the backpropagation algorithm yourself, then you'll see that the computation of the gradient of the loss will involve a lot of matrix-vector multiplications, which are much less expensive than the many matrix-matrix multiplications in the forward mode. I believe that you are able to see this yourself, but let me know if you need additional help.)

6
On
  1. You wondered why backpropagation and not "forward-propagation". Khue gave a great answer, to which there is not much to add. As he said, automatic differentiation can be done in the forward mode or in the reverse mode. One way may require fewer arithmetic operations than the other, depending on the dimensions of the free parameters and the output. It is further explained in this answer.

    As for the terminology, backpropagation stands for "backward propagation of errors", which is a name for the backward-mode differentiation in the context of neural networks. Calling a forward-mode differentiation a "forward-propagation" would be a bit inappropriate, since the error is the function's output and can be only propagated from that end.

  2. Your derivations look correct to me. I am not sure whether you were simply asking for a verification or you were trying to derive the backpropagation in your own way, but got stuck. In the latter case, what you are missing is perhaps the right interpretation of your last line:

    $$G^k=\left ( J_{x^k}L^k\cdot G^{k-1}\middle| J_{\theta^k}L^k\right ), \quad G^1=J_{\theta^1}L^1.\tag{1}\label{eq1}$$

    This recursive relation indeed prompts us to start the computation with $k=1,2,\dots$, because $G^1$ is known and $G^k$ on the left-hand side depends on $G^{k-1}$ on the right-hand side; the computation is then straightforward.

    This does not mean, however, that we can't start from the other end, $k=l,l-1,\dots$. Recall that we are interested not in $G^k$, but in the $k$-th columns of $G^l$. The last ($l$th) column of $G^l$ is readily available, as it does not depend on $G^{l-1}$:

    $$G^l=\left ( J_{x^l}L^l\cdot G^{l-1}\middle| J_{\theta^l}L^l\right ).$$

    For $k=l-1$ we need to take the second-to-last column. It does depend on $G^{l-1}$, but to be precise, it depends on the last column of $G^{l-1}$, which, in turn, does not depend on $G^{l-2}$. So we can pull it out, as follows:

    $$G^{l}=\left(J_{x^{l}}L^{l}\cdot J_{x^{l-1}}L^{l-1}\cdot G^{l-2}|J_{x^{l}}L^{l}\cdot J_{\theta^{l-1}}L^{l-1}|J_{\theta^{l}}L^{l}\right),$$ which becomes $$G^{l}=\left(J_{x^{l-1}}L^{l}\cdot G^{l-2}|J_{\theta^{l-1}}L^{l}|J_{\theta^{l}}L^{l}\right).$$

    At this point, it should be clear how to continue.

Update. In the above transition, the second-to-last column was computed as $J_{\theta^{l-1}}L^{l}=J_{x^{l}}L^{l}\cdot J_{\theta^{l-1}}L^{l-1}$. By analogy, we will observe that the consequent columns (moving from last to first) are computed as $$J_{\theta^{k-1}}L^{l}=J_{x^{k}}L^{l}\cdot J_{\theta^{k-1}}L^{k-1},\tag{2a}\label{eq3}$$

where $J_{x^{k}}L^{l}$ can be obtained through $$J_{x^{k}}L^{l}=J_{x^{k+1}}L^{l}\cdot J_{x^{k}}L^{k}.\tag{2b}\label{eq4}$$

The left-hand sides of \eqref{eq3}, \eqref{eq4} have $k-1$ and $k$, while the right-hand sides have $k$, $k+1$, and the terms we can know directly. So now you can use relations \eqref{eq3}, \eqref{eq4} recursively starting from $k=l,l-1,\dots$. This corresponds to the reverse-mode AD.

Of course, you could obtain \eqref{eq3}, \eqref{eq4} directly, without relying on your previous computations with $G^k$. I just wanted to show that where you stopped was not the dead end. If you were to start over, you would go like

Compute $J_{\theta^{1}\dots\theta^{l}}f=\left(J_{\theta^{1}}f\mid\dots\mid J_{\theta^{l}}f\right)$

where you'd carefully apply the chain rule for full derivatives in each column and would notice that the columns have common sub-expressions. I suppose, instead of going column by column you could formulate the same in a matrix form, like you did in \eqref{eq1}, but I don't see a point in such an exercise.

0
On

So, as far as I can understand, backwards differentiation is the following. After initializing $D=I$:

for $k$ from $l$ to $1$:

  1. Save $D\cdot J_{\theta^{k}}L^{k}$ as $J_{\theta^{k}}f$
  2. $D=D\cdot J_{x^{k}}L^{k}$

Is it this the algorithm that is implemented in the backward pass of every layer?