Let $f: \mathbb{R}^m \rightarrow \mathbb{R^n}$ and $g : \mathbb{R^n} \rightarrow \mathbb{R^m}$ be two vector-valued functions. We want to show that
$$J_{f\circ g}(a)=J_f(g(a))Jg(a)$$
where $J$ is the Jacobian and $a$ is a point in $\mathbb{R^n}$.
Wikipedia Chain Rule page, at Higher dimensions section, has the following:
Let $D_a(g)$ denote the total derivative of $g$ at $a$ and $D_{g(a)}(f)$ denote the total derivative of $f$ at $g(a)$. These two derivatives are linear transformations. The chain rule for total derivatives says that their composite is the total derivative of $f \circ g$ at $a$, that is:
$$ D_a(f \circ g) = D_{g(a)}(f) \circ D_a(g) \tag{1} $$
Concluding by the fact that the total derivative is Jacobian and since it is linear transformation, the composite of two total derivative becomes product of them.
My questions:
1- Why $(1)$ is true? prove this?
2- Alternative prove that start from scratch and shows $J_{f\circ g}(a)=J_f(g(a))Jg(a)$.
One idea is as follows. If $f$ and $g$ are differentiable, then by definition it exists a best linear approximation which is:
$$ g(x+h)=g(x)+dg[x](h)+o(\|h\|) $$
where $dg[x](h)$ denotes the action of the linear application $dg[x]$ on the vector $h$.
The notation $o(\|h\|)$ means that $$\lim\limits_{h\to 0} \frac{o(\|h\|)}{\|h\|}=\mathbf{0}_{\mathbb{R}^m}$$ Its interpretation is that for any constant $C$, $o(\|h\|)$ tends faster than $C.h$ to the zero vector, hence for small $h$ our linear approximation $dg[x].h$ is not "perturbed" by this reminder.
We can write the same thing for $f$: $$ f(y+k)=f(y)+df[y].k+o(\|k\|) $$
Now observe that:
\begin{align} f(g(x+h)) &=f(\overbrace{g(x)}^y+\overbrace{dg[x].h+o(\|h\|)}^k)) \\ &=f(g(x))+df[g(x)](dg[x].h+o(\|h\|))+o(\|dg(x).h+o(\|h\|)\|) \end{align}
Now from the definition of $o(.)$, it is not too hard to see that: $$ o(\|dg(x).h+o(\|h\|)\|)=o(\|h\|) $$ and (by linearity) = \begin{align} df[g(x)](dg[x].h+o(\|h\|)) &=df[g(x)](dg[x](h))+df[g(x)](o(\|h\|)) \\ &=df[g(x)](dg[x](h))+o(\|h\|) \end{align}
We finally get $$ f(g(x+h))=f(g(x))+df[g(x)](dg[x](h))+o(\|h\|) $$ and by identification $$d(f\circ g)[x]=df[g(x)]dg[x]$$ which is the expected result