Chain rule for matrix-vector composition

126 Views Asked by At

Suppose $A(t,x)$ is a $n\times n$ matrix that depends on a parameter $t$ and a variable $x$, and let $f(t,x)$ be such that $f(t,\cdot)\colon \mathbb{R}^n \to \mathbb{R}^n$.

Is there a chain rule for $$\frac{d}{dt} A(t,f(t,x))?$$

It should be something like $A_t(t,f(t,x)) + ....$, what is the other term?

3

There are 3 best solutions below

1
On BEST ANSWER

Yes, there is a chain rule for such functions. Before getting to that, let's just briefly discuss partial derivatives for multivariable functions.

Let $V_1, \dots, V_n, W$ be Banach spaces (think finite-dimensional if you wish, such as $\Bbb{R}^{k_1}, \dots \Bbb{R}^{k_n}, \Bbb{R}^m$, or spaces of matrices, like $M_{m \times n }(\Bbb{R})$ if you wish). Let $\phi: V_1 \times \dots \times V_n \to W$ be a map, and $a = (a_1, \dots, a_n) \in V_1 \times \dots \times V_n$. We say the function $\phi$ has an $i^{th}$ partial derivative at the point $a$, if the function $\phi_i:V_i \to W$ defined by \begin{align} \phi_i(x) := \phi(a_1, \dots, a_{i-1}, x, a_{i+1}, \dots a_n) \end{align} is differentiable at the point $a_i$. In this case, we define the $i^{th}$ partial derivative of $\phi$ at $a$ to be the derivative of $\phi_i$ at the point $a_i$. In symbols, we write: \begin{align} (\partial_i\phi)_a := D(\phi_i)_a \in \mathcal{L}(V_i,W). \end{align} We may also use notation like $\dfrac{\partial \phi}{\partial x^i}(a)$ or anything else which resembles the classical notation. The important thing to keep in mind is that $(\partial_i \phi)(a) \equiv \dfrac{\partial \phi}{\partial x^i}(a)$ is by definition a linear map $V_i \to W$.

Note that this is almost word for word the same definition you might have seen before (or atleast if you think about it for a while, you can convince yourself it's very similar). THe idea is of course that we fix all but the $i^{th}$ variable, and then consider the derivative of that function at the point $a_i$. Next, we need one last bit of background.

One very important special case which deserves mention is if the domain of the function is $\Bbb{R}$. So, now, suppose we have a function $\psi: \Bbb{R} \to W$. Then, we have two notions of differentiation. THe first is the familiar one as the limit of difference quotients: \begin{align} \dfrac{d\psi}{dt}\bigg|_t \equiv \dot{\psi}(t) \equiv \psi'(t) := \lim_{h \to 0} \dfrac{\psi(t+h) - \psi(t)}{h}. \end{align} (the limit on the RHS being taken with respect to the norm on $W$). The second notion of derivative is that since $\Bbb{R}$ is a vector space and $W$ is also a vector space, $\psi: \Bbb{R} \to W$ is a map between normed vector spaces. So, we can consider the derivative $D \psi_t: \Bbb{R} \to W$ as a linear map.

How are these two notions related? Very simple. Note that $\mathcal{L}(R,W)$ is naturally isomorphic to $W$ (because $\Bbb{R}$ is one-dimensional), and the isomorphism is $T \mapsto T(1)$. So, there is a theorem which says that $\psi'(t)$ (the limit of difference quotients) exists if and only if $D\psi_t$ exists (a linear map from $\Bbb{R} \to W$), and in this case, \begin{align} \psi'(t) = D\psi_t(1). \end{align} Hence forth, whenever I use $\dfrac{d}{dt}$ notation or $\dfrac{\partial}{\partial t}$ notation, where the $t$ refers to the fact that the domain is $\Bbb{R}$, I shall always mean the vector in $W$ obtained by the limit of the difference quotient (which you now know is simply the evaluation of the linear map on the element $1 \in \Bbb{R}$).

See Loomis and Sternberg's Advanced Calculus, Section $3.6-3.8$ ($3.7, 3.8$ mainly) for more information.


Anyway, the chain rule in this case is as follows: \begin{align} \dfrac{d}{dt}A(t, f(t,x)) &= \dfrac{\partial A}{\partial t}\bigg|_{(t,f(t,x))} + \dfrac{\partial A}{\partial x}\bigg|_{(t,f(t,x))} \left[ \dfrac{\partial f}{\partial t}\bigg|_{(t,x)}\right] \tag{$*$} \end{align} What does this mean? Well, on the LHS, we have a function $\psi: \Bbb{R} \to M_{n \times n}(\Bbb{R})$, defined by \begin{align} \psi(t):= A(t,f(t,x)) \end{align} and we're trying to computing $\psi'(t)$. On the RHS, note that $A: \Bbb{R} \times \Bbb{R}^n \to M_{n \times n}(\Bbb{R})$. So, the first term is $\dfrac{\partial A}{\partial t}\bigg|_{(t,f(t,x))} \in M_{n \times n}(\Bbb{R})$, which is exactly what you predicted.

Now, how do we understand the second term? Again, note that $A$ maps $\Bbb{R} \times \Bbb{R}^n \to M_{n\times n}(\Bbb{R})$. So, $\dfrac{\partial A}{\partial x}\bigg|_{(t,f(t,x))}$ is the partial deriavtive of $A$ with respect to the variables in $\Bbb{R}^n$ (i.e we're considering $V_1 = \Bbb{R}$ and $V_2 = \Bbb{R}^n$, so it's the $2$nd partial derivative of $A$), calculated at the point $(t,f(t,x)) \in \Bbb{R} \times \Bbb{R}^n$ of its domain. Note that this by definition is a linear map $\Bbb{R}^n \to M_{n \times n}(\Bbb{R})$. We are now evaluating this linear transformation on the vector $\dfrac{\partial f}{\partial t}\bigg|_{(t,x)} \in \Bbb{R}^n$ to finally end up with the matrix $\dfrac{\partial A}{\partial x}\bigg|_{(t,f(t,x))} \left[ \dfrac{\partial f}{\partial t}\bigg|_{(t,x)}\right] \in M_{n \times n}(\Bbb{R})$. This is how to read the notation in $(*)$.


If for some reason you don't like to think in terms of linear transformations, here's an alternative approach, in a simplified case, using Jacobian matrices (but I just don't like such a presentation). Suppose that $A$ is a function $A : \Bbb{R} \times \Bbb{R}^n \to \Bbb{R}^m$, and $f: \Bbb{R} \times \Bbb{R}^n \to \Bbb{R}^n$. Then, we can say \begin{align} \dfrac{d}{dt} A(t, f(t,x)) &= (\text{Jac}_{\Bbb{R}} A){(t,f(t,x))} + (\text{Jac}_{\Bbb{R}^n}A)(t, f(t,x)) \cdot \dfrac{\partial f}{\partial t}\bigg|_{(t,x)}\\ &=\dfrac{\partial A}{\partial t}\bigg|_{(t,f(t,x))} + (\text{Jac}_{\Bbb{R}^n}A)(t, f(t,x)) \cdot \dfrac{\partial f}{\partial t}\bigg|_{(t,x)}. \end{align} Note that the Jacobian matrix of $A: \Bbb{R}\times \Bbb{R}^n \to \Bbb{R}^m$ evaluated at the point $(t,f(t,x)) \in \Bbb{R}\times \Bbb{R}^n$, denoted by $(\text{Jac }A)(t, f(t,x))$ is an $m \times (1 +n)$ matrix. So, when I say $(\text{Jac}_{\Bbb{R}}A)(t, f(t,x))$, I mean the $m \times 1$ submatrix obtained by taking the first column (so that we only keep track of the derivative with respect to the $\Bbb{R}$ variable, i.e with respect to $t$). You see, this is just a vector in $\Bbb{R}^m$. Next, when I say $(\text{Jac}_{\Bbb{R}^n}A)(t, f(t,x))$, I mean the $m \times n$ submatrix obtained by ignoring the first column (so that we only keep track of the derivative with respect to the $\Bbb{R}^n$ variables). Then, we multiply this $m \times n$ matrix by the $n \times 1$ vector $\dfrac{\partial f}{\partial t}\bigg|_{(t,x)}$, to finally get a $m \times 1$ matrix, or simply a vector in $\Bbb{R}^m$.

The reason I don't like this approach is because in your case, the target space is $M_{n \times n}(\Bbb{R})$, so it is not natural to think of it as $\Bbb{R}^m$. I mean sure, you could construct an isomorphism to $\Bbb{R}^{n^2}$, but this requires a certain choice of basis in order to "vectorize" a matrix. But then in the end you will probably want to "undo" the vectorization, and then the whole thing is just a mess. Doable, but I think it's very adhoc, and that it's much cleaner to treat everything as linear transformations, because then it doesn't matter what the domain or target space are... it's pretty much linear algebra from here.

To hopefully convince more about the generality (and simplicity) of the linear transformations approach, let $V,W$ be normed vector spaces, $A: \Bbb{R} \times V \to W$ be a differentiable map, and let $f: \Bbb{R} \times V \to W$ be differentiable. Then, \begin{align} \dfrac{d}{dt} \bigg|_t A(t,f(t,x)) &= \dfrac{\partial A}{\partial t} \bigg|_{(t,f(t,x))} + \dfrac{\partial A}{\partial x} \bigg|_{(t,f(t,x))}\left[ \dfrac{\partial f}{\partial t}\bigg|_{(t,x)}\right] \in W \end{align} i.e, the formula for the chain rule stays exactly the same, regardless of what vector spaces $V,W$ are. But if you insist on thinking of everything in terms of Jacobian matrices, you're going to have a tough time first constructing isomorphisms $V \cong \Bbb{R}^n$ and $W \cong \Bbb{R}^m$, and then doing everything in the cartesian spaces, and then "undoing" the isomorphisms, to reexpress everything back in terms of the spaces $V$ and $W$.


Or of course, another way to think of it is to express everything in terms of component functions of the matrix-valued function $A$: \begin{align} \dfrac{d}{dt}A_{ij}(t,f(t,x)) &= \dfrac{\partial A_{ij}}{\partial t}\bigg|_{(t,f(t,x))} + \sum_{k=1}^n\dfrac{\partial A_{ij}}{\partial x_k}\bigg|_{(t,f(t,x))} \cdot \dfrac{\partial f_k}{\partial t}\bigg|_{(t,x)} \end{align} (all these partial derivatives being real numbers). But of course, for obvious reasons, this component-by-component approach can get very tedious very quickly (and doesn't generalize well), and also it didn't seem to be what you really wanted to ask about, which is why I'm mentioning it at the end.

1
On

Let $y=(y_1, \ldots, y_n) = (f_1(t, x), \ldots, f_n(t, x))$ Consider the $a_{ij}$ entry, then we see that \begin{align} \frac{d}{dt}a_{ij}(t, f(t,x)) = \partial_ta_{ij}(t, f(t, x))+\sum^n_{k=1} \underbrace{a_{ij, k}(t, f(t, x)}_{\partial_k a_{ij}(t, f(t, x))}\partial_t f_k(t, x). \end{align} Hence it follows \begin{align} \frac{d}{dt}A(t, f(t, x))= \frac{\partial A}{\partial t}+\nabla A\cdot \partial_t f. \end{align} Here $\nabla A$ is the gradient of a rank $2$ tensor which is a rank $3$ tensor (I am assuming everything is in Euclidean space). In general, $\nabla A$ is called the covariant derivative of $A$. Also, note that $\nabla A\cdot \partial_t f$ is the directional derivative of $A$ in the direction $\partial_t f$.

0
On

Consider the scalar function $\alpha$ which matches the proposed functional form, i.e. $$\eqalign{ &\alpha = \alpha(t,f) \qquad &f = f(t,x) \\ &\alpha,t\in{\mathbb R}^{1} \qquad &f,x\in{\mathbb R}^{n} }$$ Everyone knows how to go about calculating its total time derivative $$\eqalign{ \frac{d\alpha}{dt} &= \frac{\partial\alpha}{\partial t} + \left(\frac{\partial\alpha}{\partial f}\cdot\frac{\partial f}{\partial t}\right) + \left(\frac{\partial\alpha}{\partial f}\cdot\frac{\partial f}{\partial x}\cdot\frac{\partial x}{\partial t}\right) }$$ The only twist with the matrix $A={\bf[}\alpha_{ij}{\bf]}\;$ is that each element is such a function.
Therefore $$\eqalign{ \frac{dA}{dt} = \left[\frac{d\alpha_{ij}}{dt}\right] }$$