Consider the differentiable functions $L^1(x,\theta^1),L^2(x^2,\theta^2),...,L^l(x^l,\theta^l)$, where every $x_k,\theta^k$ are real vectors, for $k=1,...,l$. Also define $\theta=(\theta^1,...,\theta^l)$.
Define the composite function $f(x,\theta)=x^{l+1}$ recursively by doing $x^k= L^{k-1}(x^{k-1},\theta^{k-1})$, $x^1=x$.
Compute $J_\theta f$, the jacobian of $f$ with respect to $\theta$
For some context, I'm trying to implement gradient descent for optimizing the loss function of a neural network, and if my computations are correct I don't understand why we do back-propagation, instead of, say, forward-propagation... Here is my attempt, is there any mistake?
Compute $J f$: using the chain rule: $$ Jf=JL^l(x^l,\theta^l)= \left ( J_{x^l}L^l\cdot J_{x,\theta^1,...,\theta^{l-1}}x^l \middle| J_{\theta^l}L^l\right )= \left ( J_{x^l}L^l\cdot J_{x,\theta^1,...,\theta^{l-1}}L^{l-1} \middle| J_{\theta^l}L^l\right )$$ Hence we can write $Jf=J^l$, where $J^l$ is given by the following recursive rule: $$J^k=\left ( J_{x^k}L^k\cdot J^{k-1}\middle| J_{\theta^k}L^k\right ), \quad J^1=J_{x,\theta^1}L^1$$
Obtain $J_\theta f$: we want to obtain the last columns of $Jf$, corresponding to the derivatives with respect to $\theta^1,...,\theta^l$. Clearly $$J_\theta f=\left ( J_{x^l}L^l\cdot J_{\theta^1,...,\theta^{l-1}}L^{l-1} \middle| J_{\theta^l}L^l\right )$$ Hence $J_\theta f=G^l$, where: $$G^k=\left ( J_{x^k}L^k\cdot G^{k-1}\middle| J_{\theta^k}L^k\right ), \quad G^1=J_{\theta^1}L^1$$
It's straightforward to see that the gradient of the output with respect to all the parameters can be computed in a recursive, forward manner (as you have shown above). This procedure is called the forward-mode differentiation. The well-known backpropagation algorithm, on the other hand, is a special case of the reverse-mode differentiation, which is much harder to see (that's why its invention is appreciated).
The question is, if the forward-mode differentiation is straightforward, why do people keep using the reverse mode?
The answer lies in the computational efficiency of the reverse mode. Indeed, for a general computational graph, if the dimension of the input is much larger than the output's, then the reverse mode is much more efficient (and vice-versa). This is a well-known result in automatic differentiation (see e.g. Who Invented the Reverse Mode of Differentiation? by Griewank).
It turns out that, in machine learning, the so-called training task often involves the gradient of a scalar-valued objective function with respect to a large number of parameters, i.e. the dimension of the output (1d) is much smaller than the dimension of the parameter vector (as well as the dimension of the input features), and thus the reverse-mode differentiation is much more efficient in this case.
(Try deriving the backpropagation algorithm yourself, then you'll see that the computation of the gradient of the loss will involve a lot of matrix-vector multiplications, which are much less expensive than the many matrix-matrix multiplications in the forward mode. I believe that you are able to see this yourself, but let me know if you need additional help.)