I did this derivation and want to know if it is correct. Basically, we want to know the gradient of $f$ at the point $g(M)$.
Let $\Psi \equiv g \circ f$ be the function representing the forward pass and let $f: \mathbb{R}^{m \times n} \to \mathbb{R}^{m \times n}$ represent the part of the forward pass we are interested in, where $m$ is the batch size and $n$ the array length. Assume we can decompose $f$ into $d \in \mathbb{N}^+$ functions $f_1, f_2, \dots, f_d$: \begin{equation*} f \equiv f_d \circ f_{d-1} \circ \dots \circ f_1 \end{equation*}
Now define $f^i$ for $i \in \{0, 1, \dots, d\}$, where $f_0(M) := M$: \begin{equation*} f^i := f_i \circ f_{i - 1} \circ \dots \circ f_1 \circ f_0 \end{equation*}
Defininig $f_0$ as the identity function is just a trick to simplify the below expression. Please note that we already know $\nabla g(f(M))$ as that is one of the inputs to our backward pass and since we did the forward pass, we also calculated all the $f^i$, $i \in \{0, 1, \dots, d\}$.
Now let $M \in \mathbb{R}^{m \times n}$. It follows using the chain rule: \begin{equation*} \begin{aligned} \nabla (g \circ f)(M) & = \nabla g(f(M)) \cdot \nabla f^d(M) \\ & = \nabla g(f(M)) \cdot \nabla f_d(f^{d-1}(M)) \cdot \nabla f^{d-1}(M) \\ & = \nabla g(f(M)) \cdot \nabla f_d(f^{d-1}(M)) \cdot \nabla f_{d-1}(f^{d-2}(M)) \cdot \nabla f^{d-2}(M) \\ & \;\;\vdots \\ & = \nabla g(f(M)) \cdot \prod_{i=1}^{d-1} \left(\nabla f_{d-i+1}(f^{d-i}(M))\right) \cdot \nabla f^1(M) \\ &= \underbrace{\nabla g(f(M))}_{\text{known}} \cdot \prod_{i=1}^{d} \nabla f_{d-i+1}(\underbrace{f^{d-i}(M)}_{\text{known}}) \end{aligned} \end{equation*}
My derivation implies that the order of calculating the $\nabla f_i(f^{i-1}(M))$ is irrelevant. However, the technique to do automatic differentiation is called backward pass, and as I understand it, you always use the previous derivative as an input to the next one, taking derivatives of the composed parts of the desired function in reverse order. Did I make a mistake or does the order not matter? I'm not specifically looking to apply this to neural networks. Is the order maybe only relevant if we have a neural network, where we want to update every layer $k$ for which we need $\prod_{i=1}^k \nabla f_{d-i+1}(f^{d-i}(M))$?
More specifically, does this mean the backwards pass can be parallelized to some degree?