Consider the differentiable functions $L^1(x,\theta^1),L^2(x^2,\theta^2),L^3(x^3,\theta^3)$, where every $x_k,\theta^k$ are real vectors, for $k=1,2,3$. Also define $\theta=(\theta^1,\theta^2,\theta^3)$ (and note that $x$ is not $ (x^1,x^2,x^3)$).
What is the jacobian, with respect to $\theta$, of $f(x,\theta)=L^3(L^2(L^1(x,\theta^1),\theta^2),\theta^3)?$
This question arose when computing the gradients in the backpropagation phase of a neural network, and I got a result that I don't think is correct (or at least it is not the one which backpropagation algorithms use).
Here is my try. Using the chain rule:
$Jf=JL^3 \cdot J(L^2(L^1(x,\theta^1),\theta^2),\theta^3)=JL^3 \begin{pmatrix} J_{x,\theta^1,\theta^2}L^2(L^1(x,\theta^1),\theta^2) & 0\\ 0 & I \end{pmatrix}=\left ( J_{x^3}L^3\cdot J_{x,\theta^1,\theta^2}L^2(L^1(x,\theta^1),\theta^2)\middle |J_{\theta^3}L^3\right )$
Hence $Jf=\left ( J_{x^3}L^3\cdot JL^2(L^1(x,\theta^1),\theta^2)\middle |J_{\theta^3}L^3\right )$, and by the above reasoning: $$Jf=\left ( J_{x^3}L^3\cdot \left (J_{x^2}L^2\cdot JL^1\middle | J_{\theta^2}L^2 \right )\middle |J_{\theta^3}L^3\right )=\left ( J_{x^3}L^3\cdot J_{x^2}L^2\cdot JL^1 \middle | J_{x^3}L^3 \cdot J_{\theta^2}L^2 \middle | J_{\theta^3}L^3\right )$$
I must conclude that $J_\theta f=\left ( J_{x^3}L^3\cdot J_{x^2}L^2\cdot J_{\theta^1}L^1 \middle | J_{x^3}L^3 \cdot J_{\theta^2}L^2 \middle | J_{\theta^3}L^3\right )$: is this correct?
Let me derive backpropagation for the above 3-layer feedforward neural network. For more general networks (e.g. residual ones), the idea is similar and I'll give some hints for dealing with those at the end of this answer.
Suppose that we feed an input $x\in\mathbb{R}^d$ to the network to produce an output loss value $L\in\mathbb{R}$ (given by a differentiable loss function $\ell$): \begin{align} x_1 &= f_1(x, \theta_1),\\ x_2 &= f_2(x_1, \theta_2),\\ x_3 &= f_3(x_2, \theta_3),\\ L &= \ell(x_3). \end{align} The goal is to compute the gradient $\left(\frac{\partial L}{\partial \theta_1}, \frac{\partial L}{\partial \theta_2}, \frac{\partial L}{\partial \theta_3}\right)$.
Some notes on the notation:
Now let's backprop. The computation graph in this case is simple: $\require{AMScd}$ \begin{CD} x @>>> x_1 @>>> x_2 @>>> x_3 @>>> L\\ @. @AAA @AAA @AAA\\ @. \theta_1 @. \theta_2 @. \theta_3 \end{CD} We go backward from the last node and use the chain rule: \begin{align} \frac{\partial L}{\partial x_3} &= \ell'(x_3),\\ \frac{\partial L}{\partial \theta_3} &= \frac{\partial x_3}{\partial \theta_3}\frac{\partial L}{\partial x_3},\\ \frac{\partial L}{\partial x_2} &= \frac{\partial x_3}{\partial x_2}\frac{\partial L}{\partial x_3},\\ \frac{\partial L}{\partial \theta_2} &= \frac{\partial x_2}{\partial \theta_2}\frac{\partial L}{\partial x_2},\\ \frac{\partial L}{\partial x_1} &= \frac{\partial x_2}{\partial x_1}\frac{\partial L}{\partial x_2},\\ \frac{\partial L}{\partial \theta_1} &= \frac{\partial x_1}{\partial \theta_1}\frac{\partial L}{\partial x_1}. \end{align} That was the exact order of how the derivative computation should be implemented. One can plug the intermediate terms into the main terms (i.e. the ones w.r.t. $\theta$) to get direct formula, for example \begin{equation}\label{Lt1} \frac{\partial L}{\partial \theta_1} = \frac{\partial x_1}{\partial \theta_1} \frac{\partial x_2}{\partial x_1} \frac{\partial x_3}{\partial x_2}\frac{\partial L}{\partial x_3}. \tag{1} \end{equation} However, this should not be used for an implementation as it will re-compute same quantities multiple times.
Some other important notes, again on the notation:
Because $x_3 = f_3(x_2, \theta_3)$, one can view $x_3$ as a function of $x_2$ and $\theta_3$, i.e. $x_3 := x_3(x_2, \theta_3)$ and thus $\frac{\partial x_3}{\partial \theta_3}$ denotes the partial derivative of that function with respect to $\theta_3$ (evaluated at $(x_2, \theta_3)$). Similarly for the other quantities. Using the Jacobian notation, \eqref{Lt1} for example can be written as: \begin{align} \frac{\partial L}{\partial \theta_1} &= (J_{\theta_1} f_1) (x, \theta_1) \times (J_{x_1} f_2) (x_1, \theta_2) \times (J_{x_2} f_3) (x_2, \theta_3) \times (J \ell)(x_3). \end{align} If with slight abuse of notation we omit the values at which the functions are evaluated, then the above become: \begin{align} \frac{\partial L}{\partial \theta_1} &= J_{\theta_1} f_1 \times (J_{x_1} f_2) \times (J_{x_2} f_3) \times (J \ell). \end{align} I hope this makes it easier for you because you seem to be familiar with the Jacobian notation. You can now easily compare this result with yours.
A more rigorous presentation should use the total derivative notation instead of using $\partial$ everywhere, like what I have done (and like in many deep learning references). For example, one should write: $\newcommand{\dv}[1]{\operatorname{d}\!{#1}}$ \begin{equation} \frac{\dv{L}}{\dv{\theta_1}} = \frac{\partial x_1}{\partial \theta_1}\frac{\dv{L}}{\dv{x_1}}. \end{equation}
For a more general computation graph, the principle is the same. One has to compute recursively $\frac{\dv{L}}{\dv{x_i}}$ for every nodes $x_i$ of the graph. The recursion lies in the fact that, the derivative with respect to one node can be computed once the derivatives of all its children have been computed, using the chain rule: \begin{equation} \frac{\dv{L}}{\dv{x_i}} = \sum_{j\in\mathrm{Children}(i)} \frac{\partial x_j}{\partial x_i} \frac{\dv{L}}{\dv{x_j}}. \end{equation}
Let me know if you have further questions.