Good evening, I'm self-learning Analysis and often encounter the chain rule as follows
Let $f: \mathbb R^k \to \mathbb R^m$ be differentiable at $a\in \mathbb R^n$ and $g: \mathbb R^m \to \mathbb R^n$ differentiable at $f(a) \in \mathbb R^n$. Then $g \circ f(a)$ is differentiable at $a$ and $\partial (g \circ f)(a) = \partial g (f(a)) \partial f(a)$.
Because I'm going to learn multilinear maps (by myself) as an preparation for the definitions of higher derivative and higher order partial derivatives, I would like to make sure that I correctly understand the meaning behind the formula of this important theorem.
I have written all of my thought below and I'm sorry because it's quite long. Could you please have a look and verify it? Thank you so much for your keen help!
As an usual convention, vector is considered as a matrix with one column.
Here $\partial f(a)$ is the derivative of $f$ at $a$ and thus $\partial f(a) \in \mathcal L(\mathbb R^k,\mathbb R^m)$. Similarly, $\partial g (f(a))$ is the derivative of $g$ at $f(a)$ and thus $\partial g (f(a)) \in \mathcal L(\mathbb R^m,\mathbb R^n)$. Here $\partial g (f(a)) \partial f(a)$ is not the arithmetic product of $\partial g\left(f\left(x_{0}\right)\right)$ and $\partial f\left(x_{0}\right)$. Instead, $\partial g (f(a)) \partial f(a)$ is the function composition of $\partial g (f(a))$ and $\partial f(a)$, i.e. $\partial (g \circ f)(a) = \partial g (f(a)) \circ \partial f(a) \in \mathcal L(\mathbb R^k,\mathbb R^n)$.
Because the spaces of matrices $\mathcal M_{m \times k}(\mathbb R)$ and $\mathcal M_{n \times m}(\mathbb R)$ are isomorphic to $\mathcal L(\mathbb R^k,\mathbb R^m)$ and $\mathcal L(\mathbb R^m,\mathbb R^n)$ respectively. As such, there is a unique ${\bf A} \in \mathcal M_{m \times k}(\mathbb R)$ such that $\partial f(a) (v) = {\bf A}v$ for all $v \in \mathbb R^k$. Similarly, there is a unique ${\bf B} \in \mathcal M_{n \times m}(\mathbb R)$ such that $\partial g (f(a)) (v) = {\bf B}v$ for all $v \in \mathbb R^m$. It follows from Linear Algebra that $\partial g (f(a)) \circ \partial f(a)$ is associated with the matrix ${\bf BA} \in \mathcal M_{n \times k}(\mathbb R)$. Hence $\partial g (f(a)) \circ \partial f(a) (v)= {\bf BA}v$ for all $v \in \mathbb R^k$. Here ${\bf A}v$, ${\bf B}v$, and ${\bf BA}v$ are all matrix multiplication.
For the a linear map $h$ between Euclidean vector spaces, we denote by $[h]$ its associated matrix. It follows that the chain rule can be rigorously written as $$\partial (g \circ f)(a) (v)= [ \partial g (f(a))] [\partial f(a)] v,\quad v \in \mathbb R^k$$
Because we can identify $\partial f(a)$ with $[\partial f(a)]$ and $\partial g(f(a))$ with $[\partial g(f(a))]$, the chain rule can be consequently written as $$\partial (g \circ f)(a) (v) = \partial g (f(a)) \partial f(a) v,\quad v \in \mathbb R^k$$ with an implicit understanding that $\partial g (f(a)) \partial f(a)$ considered as a matrix, which is resulted from the matrix multiplication of $[\partial g (f(a))]$ and $[\partial f(a)]$.
Because we can identify $\partial (g \circ f)(a)$ with its associated matrix which, in the sense mentioned above, is $\partial g (f(a)) \partial f(a)$, the chain rule can be consequently written as $$\partial (g \circ f)(a) = \partial g (f(a)) \partial f(a)$$
As such, I can understand the chain rule's formula $\partial (g \circ f)(a) = \partial g (f(a)) \partial f(a)$ in two ways:
- It is a function composition of two maps $\partial g (f(a))$ and $\partial f(a)$, i.e.
$$\begin {array}{l|rcl} \partial (g \circ f)(a) & \mathbb R^k & \longrightarrow & \mathbb R^n \\ & v & \longmapsto & \partial g (f(a)) \circ \partial f(a) (v) \end{array}$$
- It is a matrix multiplication of $[\partial g (f(a))]$ and $[\partial f(a)]$. i.e.
$$\begin {array}{l|rcl} \partial (g \circ f)(a) & \mathbb R^k & \longrightarrow & \mathbb R^n \\ & v & \longmapsto & [\partial g (f(a))] [\partial f(a)] v \end{array}$$
Your write up looks fine. Some broad commentary:
Assuming that $A,B,C$ are suitable spaces, with $f:A\to B$ and $g:B \to C$, then I will denote the derivative of $f$ at $a \in A$ by $Df(a)$. Note that $Df(a):A \to B$ is a linear operator, so given $\delta \in A$ we have $Df(a)(\delta) \in B$.
The chain rule states that if $h=g \circ f$, then $Dh(a)= Dg(f(a)) \circ Df(a)$, the composition of linear operators.
If $A=\mathbb{}R^k$, $B=\mathbb{}R^m$, then we typically identify a linear operator $L:A \to B$ with the matrix representation in terms of the standard bases $e_i$ and write the evaluation at $\delta \in A$ as $L \cdot \delta$ rather than $L(\delta)$. Similarly, if $C=\mathbb{R}^n$, and $M:B \to C$ is also linear, we write evaluation of $M \circ L$ at $\delta \in A$ as $M \cdot L \cdot \delta$ rather than $M(L(\delta))$.
However, one cannot always write the evaluation of a linear operator on a finite dimensional space as the product of a single matrix and a point in the space, for example, if $A=\mathbb{R}^{2 \times 2}$, $B=\mathbb{R}$ with $L(a) = [a]_{11}$, then this cannot be expressed in the form $D \cdot a$.