Chain rule and generalized composition of multilinear maps

218 Views Asked by At

I know that if functions $f : \mathbb{R}^n \to \mathbb{R}^m$ and $g : \mathbb{R}^m \to \mathbb{R}^p$ are differentiable at $x \in \mathbb{R}^n$ and $f(x) \in \mathbb{R}^m$, respectively, with derivative values $Df(x) \in L(\mathbb{R}^n, \mathbb{R}^m)$ and $Dg(f(x)) \in L(\mathbb{R}^m, \mathbb{R}^p)$, then it is possible to define composition function $(g \circ f) : \mathbb{R}^n \to \mathbb{R}^p$ for all $x \in \mathbb{R}^n$ as $(g \circ f)(x) = g(f(x))$ and show that it is also differentiable with a derivative, where for visual simplicity I use square brackets for evaluation of a linear function: $$ D(g \circ f)(x) = Dg(f(x)) \circ Df(x)$$

Now, my problem is how to generalize this result to higher order derivative and multilinear maps. To see where the problem arises, let me consider the second derivative.

First, motivated by the result of the chain rule, define functions $Dg \circ f : \mathbb{R}^n \to L(\mathbb{R}^m, \mathbb{R}^p)$ for all $x \in \mathbb{R}^n$ as $(Dg \circ f)(x) = Dg(f(x))$, and $Df : \mathbb{R}^n \to L(\mathbb{R}^n, \mathbb{R}^m)$ for all $x \in \mathbb{R}^n$ as $Df(x)$. Then, define composition of linear functions, where $x \in \mathbb{R}^n$ and $A,B$ are from correct $L$ spaces, as $(A \circ B)(x) = A(x) \circ B(x)$. All of these definitions are made such that the chain rule for all $x \in \mathbb{R}^n$ can be written as follows.

$$ D(g \circ f)(x) = ((Dg \circ f) \circ Df)(x) $$

Then, to motivate what the second derivative should be, consider the difference of $D(g \circ f)(x)$, where we would like to use that $Dg \circ f$ and $Df$ are also differentiable. This means that for $y$ close to $x$ we have: $$(Dg \circ f)(y) - (Dg \circ f)(x) \approx D(Dg \circ f)(x)[y-x]$$ $$Df(y) - Df(x) \approx D(Df)(x)[y-x] $$

Here, $D(Dg \circ f)(x) \in L(\mathbb{R}^n, L(\mathbb{R}^m, \mathbb{R}^p)) =: L^2(\mathbb{R}^n, \mathbb{R}^m, \mathbb{R}^p)$ and $D(Df)(x) \in L(\mathbb{R}^n, L(\mathbb{R}^n, \mathbb{R}^m)) =: L^2(\mathbb{R}^n, \mathbb{R}^n, \mathbb{R}^m)$.

So, intuition for the second derivative chain rule formula would be as follows.

$$ D(g \circ f)(y) - D(g \circ f)(x) = (Dg \circ f)(y) \circ Df(y) - (Dg \circ f)(x) \circ Df(x) = ((Dg \circ f)(y) - (Dg \circ f)(x)) \circ Df(y) + (Dg \circ f)(x) \circ (Df(y) - Df(x)) \approx D(Dg \circ f)(x)[y-x] \circ Df(x) + (Dg \circ f)(x) \circ D(Df)(x)[y-x]$$

On the other hand, we want the previous equation to be equal to $D(D(g \circ f))(x)[y-x]$. Therefore, I feel that one is motivated to define the following for appropriate multilinear maps $A,B$, which would be what I call "generalized composition":

$$ (A \circ_{L^2, L^1} B)[x] := A[x] \circ B,$$ $$ (A \circ_{L^1, L^2} B)[x] := A \circ B[x].$$

Then, the result would be written as follows. $$ D(D(g \circ f))(x) = D(Dg \circ f)(x) \circ_{L^2, L^1} Df(x) + (Dg \circ f)(x) \circ_{L^1, L^2} D(Df)(x)$$

However, if I now consider third derivative of $g \circ f$, this approach seems to "fail" in the following way. If $A, B$ are both bilinear functions, so, belong to some $L^2$, there is not a unique extension of my approach to generalized composition, as both are possible but I believe they are not equivalent.

$$ (A \circ_{L^2, L^2} B)[x] = A[x] \circ_{L^1, L^2} B, $$ $$ (A \circ_{L^2, L^2} B)[x] = A \circ_{L^2, L^1} B[x].$$

Question: How to define and motivate unique generalized composition between multilinear functions so that it naturally appears in higher order chain rule results and that is well defined?

1

There are 1 best solutions below

2
On

I think one difficulty is that you are using the same notation for up composition of linear maps and composition of smooth functions. For example, in the formula $$ D(g\circ f)(x) = Dg(f(x)) \circ Df \tag 1 $$ it's not so clear to just drop the $x$, since writing $(Dg\circ f) \circ Df$ kinda mixes up who gets composed with whom. As you noted, it gets worse then the linear maps have more than one entry. I think what you want can be achieved with abstract index notation, which helps separate the two. Instead of writing the evaluation of a linear map $T:V\to V$ on a vector $v$ as $T(v)$, write it as $T^b_av^a$, where the upper $b$ indicates that $T$ spits out a vector of $V$.

To handle vectors from different vector spaces, you can use different sets of letters. Like, in your situation, you have $\def\R{\mathbb R}$ $$ \R^p \xleftarrow f \R^m \xleftarrow g \R^n $$ then you can write, say, lowercase indices ($a,b,c$...) for vectors from $\R^n$, uppercase indices ($A,B,C$...) for vectors coming from $\R^m$, and Greek indices ($\Psi,\Omega$...) for vectors in $\R^p$.

Then, for a linear map $T\in L(\R^n,\R^m)$ acting on a vector $v\in\R^n$, you can write $T(v)^A=T^A_av^a \in\R^m$. If you have another linear map $S\in L(\R^m,\R^p)$, then write $S(T(v))^\Psi=L^\Psi_AT^A_av^a$. You can also drop the vector and write $(S\circ T)^\Psi_a=S^\Psi_AT^A_a$, since $S\circ T$ eats something from $\R^n$ and spits out something from $\R^p$.

With this notation, formula $(1)$ can be written as $$ D(g\circ f)^\Psi_a = (Dg\circ f)^\Psi_A Df^A_a $$ where I wrote $Df^A_a$ instead of $(Df)^A_a$ just for ease of writing. The advantage of this is that now all the $\circ$s are for composition of smooth functions and the composition of linear maps is taken care of by the indices. One can moreover move the indices to the $D$s, which will help us help distinguish who eats whom when we take higher order derivatives. $$ D_a(g\circ f)^\Psi = (D_Ag\circ f)^\Psi D_af^A $$ much better.

Examples

Taking a second derivative we will introduce another lowercase index $b$, which means the new object now will eat two vectors from $\R^n$ $$ \begin{align*} D_bD_a(g\circ f)^\Psi &= D_b((D_Ag\circ f)^\Psi D_af^A) \\ &= D_b(D_Ag\circ f)^\Psi D_af^A + (D_Ag\circ f)^\Psi D_bD_af^A \\ &= (D_BD_Ag\circ f)^\Psi D_bf^BD_af^A + (D_Ag\circ f)^\Psi D_bD_af^A \\ \end{align*} $$ Or a third derivative... $$ \begin{align*} D_cD_bD_a(g\circ f)^\Psi &= D_c((D_BD_Ag\circ f)^\Psi D_bf^BD_af^A + (D_Ag\circ f)^\Psi D_bD_af^A) \\ &= D_c((D_BD_Ag\circ f)^\Psi D_bf^BD_af^A) + D_c((D_Ag\circ f)^\Psi D_bD_af^A) \\ &= D_c(D_BD_Ag\circ f)^\Psi D_bf^B D_af^A \\ &\hspace{10mm} + (D_BD_Ag\circ f)^\Psi D_cD_bf^B D_af^A \\ &\hspace{10mm} + (D_BD_Ag\circ f)^\Psi D_bf^B D_cD_af^A \\ &\hspace{10mm} + D_c(D_Ag\circ f)^\Psi D_bD_af^A \\ &\hspace{10mm} + (D_Ag\circ f)^\Psi D_cD_bD_af^A \\ &= (D_CD_BD_Ag\circ f)^\Psi D_cf^C D_bf^B D_af^A \\ &\hspace{10mm} + (D_BD_Ag\circ f)^\Psi D_cD_bf^B D_af^A \\ &\hspace{10mm} + (D_BD_Ag\circ f)^\Psi D_bf^B D_cD_af^A \\ &\hspace{10mm} + (D_BD_Ag\circ f)^\Psi D_cf^B D_bD_af^A \\ &\hspace{10mm} + (D_Ag\circ f)^\Psi D_cD_bD_af^A \end{align*} $$ So we got ourselves three scarry looking formulas full of indices $$ \begin{align} D_a(g\circ f)^\Psi &= (D_Ag\circ f)^\Psi D_af^A \\ D_bD_a(g\circ f)^\Psi &= (D_BD_Ag\circ f)^\Psi D_bf^BD_af^A + (D_Ag\circ f)^\Psi D_bD_af^A \\ D_cD_bD_a(g\circ f)^\Psi &= (D_CD_BD_Ag\circ f)^\Psi D_cf^C D_bf^B D_af^A \\ &\hspace{10mm} + (D_BD_Ag\circ f)^\Psi D_cD_bf^B D_af^A \\ &\hspace{10mm} + (D_BD_Ag\circ f)^\Psi D_bf^B D_cD_af^A \\ &\hspace{10mm} + (D_BD_Ag\circ f)^\Psi D_cf^B D_bD_af^A \\ &\hspace{10mm} + (D_Ag\circ f)^\Psi D_cD_bD_af^A \tag 2 \end{align} $$ but if we were to put in three vectors $u^a,v^b,w^c$ we would know exactly to whom we should feed each of them! For example, for the third derivative you have

$$ \begin{align} D^3(g\circ f)[w,v,u]^\Psi &= D_cD_bD_a(g\circ f)^\Psi u^a v^b w^c \\ &= (D_CD_BD_Ag\circ f)^\Psi D_cf^C D_bf^B D_af^A u^a v^b w^c \\ &\hspace{10mm} + (D_BD_Ag\circ f)^\Psi D_cD_bf^B D_af^A u^a v^b w^c \\ &\hspace{10mm} + (D_BD_Ag\circ f)^\Psi D_bf^B D_cD_af^A u^a v^b w^c \\ &\hspace{10mm} + (D_BD_Ag\circ f)^\Psi D_cf^B D_bD_af^A u^a v^b w^c \\ &\hspace{10mm} + (D_Ag\circ f)^\Psi D_cD_bD_af^A u^a v^b w^c \\ &= (D_CD_BD_Ag\circ f)^\Psi Df[w]^C Df[v]^B Df[u]^A \\ &\hspace{10mm} + (D_BD_Ag\circ f)^\Psi D^2f[w,v]^B Df[u]^A \\ &\hspace{10mm} + (D_BD_Ag\circ f)^\Psi Df[v]^B D^2f[w,u]^A \\ &\hspace{10mm} + (D_BD_Ag\circ f)^\Psi Df[w]^B D^2f[v,u]^A \\ &\hspace{10mm} + (D_Ag\circ f)^\Psi D^3f[w,v,u]^A \\ &= (D^3g\circ f)[Df[w],Df[v],Df[u]]^\Psi \\ &\hspace{10mm} + (D^2g\circ f)[D^2f[w,v],Df[u]]^\Psi \\ &\hspace{10mm} + (D^2g\circ f)[Df[v],D^2f[w,u]]^\Psi \\ &\hspace{10mm} + (D^2g\circ f)[Df[w],D^2f[v,u]]^\Psi \\ &\hspace{10mm} + (Dg\circ f)[D^3f[w,v,u]]^\Psi \\ &\hspace{10mm} \end{align} $$ where now I used the notation of square brackets for linear evaluation without point evaluation. Hence, $$ \begin{align} D^3(g\circ f)[w,v,u] &= (D^3g\circ f)[Df[w],Df[v],(Df)[u]] \\ &\hspace{10mm} + (D^2g\circ f)[D^2f[w,v],Df[u]] \\ &\hspace{10mm} + (D^2g\circ f)[Df[v],D^2f[w,u]] \\ &\hspace{10mm} + (D^2g\circ f)[Df[w],D^2f[v,u]] \\ &\hspace{10mm} + (Dg\circ f)[D^3f[w,v,u]] \end{align} $$ As mentioned above, you still need to evaluate the whole expression at a point $x\in\R^n$, we get $$ \begin{align} D^3(g\circ f)(x)[w,v,u] &= D^3g(f(x))[Df(x)[w],(Df)[v],Df(x)[u]] \\ &\hspace{10mm} + D^2g(f(x))[D^2f(x)[w,v],Df(x)[u]] \\ &\hspace{10mm} + D^2g(f(x))[Df(x)[v],D^2f(x)[w,u]] \\ &\hspace{10mm} + D^2g(f(x))[Df(x)[w],D^2f(x)[v,u]] \\ &\hspace{10mm} + Dg(f(x))[D^3f(x)[w,v,u]] \end{align} $$

Final commentary

Actually I'm more fond of the indexed formulas, as in $(2)$. Let me say why. When evaluating without indices, I implicitly made the convention that the first slot of the derivative corresponds to the last variation: $$ D^2f[u,v]^B = D_aD_bf^B u^a v^b $$ but I could have chosen the opposite convention: $$ D^2f[u,v]^B = D_bD_af^B u^a v^b. $$ In flat space this doesn't matter, because, well, the Hessian is symmetric because second derivatives commute: $D_aD_b=D_bD_a$, at least if the function is continuously differentiable, that is. But it becomes a trouble when doing calculus in curved spaces, where curvature is measured precisely by the lack of commutativity of the second covariant derivatives. I have seen this little ambiguity elswhere, and I think index notation helps solve this issue. Oh, well, I think that's all I had to say for now. Hope it helps.