Multivariable chain rule with vector valued function

955 Views Asked by At

Suppose $f:\mathbb{R}^n \rightarrow \mathbb{R}$, $\mathbf{g}:\mathbb{R}^n \rightarrow \mathbb{R}^n$ and $\mathbf{x} \in \mathbb{R}^n$. How do I find a formula for $\nabla (f \circ g)(x)$?

4

There are 4 best solutions below

0
On

We have \begin{align*} &f:\mathbb{R}^n \rightarrow \mathbb{R}\\ &\mathbf{y}\to f(\mathbf{y})=f(y_1,\ldots,y_n)\\ \\ &\mathbf{g}:\mathbb{R}^n \rightarrow \mathbb{R^n}\\ &\mathbf{x}\to \mathbf{g}(\mathbf{x})=\mathbf{g}(x_1,\ldots,x_n)\\ &\qquad\qquad\,=(g_1(x_1,\ldots,x_n),\ldots,g_n(x_1,\ldots,x_n))\\ \end{align*}

Recalling the nabla-operator applied to $f$ \begin{align*} \nabla f(\mathbf{y})=\left(\frac{\partial f}{\partial y_1},\ldots,\frac{\partial f}{\partial y_n}\right) \end{align*}

we obtain \begin{align*} \color{blue}{\nabla f(\mathbf{g}(\mathbf{x}))} &=\left(\frac{\partial }{\partial x_j}f(\mathbf {g(\mathbf{x})})\right)_{1\leq j\leq n}\\ &=\left(\frac{\partial }{\partial x_j} f(\mathbf {g}(x_1,\ldots,x_n))\right)_{1\leq j\leq n}\\ &=\left(\frac{\partial }{\partial x_j}f(g_1(x_1,\ldots,x_n),\ldots,g_n(x_1,\ldots,x_n))\right)_{1\leq j\leq n}\\ &=\left(\frac{\partial f}{\partial g_1}\cdot\frac{\partial g_1}{\partial x_j}+\cdots+ \frac{\partial f}{\partial g_n}\cdot\frac{\partial g_n}{\partial x_j}\right)_{1\leq j\leq n}\\ &\,\,\color{blue}{=\left(\sum_{k=1}^n\frac{\partial f}{\partial g_k}\cdot\frac{\partial g_k}{\partial x_j}\right)_{1\leq j\leq n}} \end{align*}

3
On

Personally I find it easier to work in more generality (where we don't have this strange asymmetry between the scalar-valuedness of $f$ and the vector-valuedness of $g$, and so the situation is more analogous to the one-dimensional chain rule).

First the general definition of the derivative of a function between higher-dimensional spaces:

We say that a function $f \colon \mathbb{R}^m \to \mathbb{R}^n$ is differentiable at a point $\mathbf{x} \in \mathbb{R}^m$ if there exists a matrix $Df_\mathbf{x} \in \mathbb{R}^{n \times m}$ (called the derivative of $f$ at $\mathbf{x}$) such that $$ \frac{f(\mathbf{x}+\mathbf{h})-f(\mathbf{x})-Df_\mathbf{x}\mathbf{h}}{|\mathbf{h}|} \to \mathbf{0} \hspace{5mm} \textrm{as } \mathbf{h} \to \mathbf{0}. $$ In this case, if we write $\,f_i \colon \mathbb{R}^m \to \mathbb{R}$ (with $i \in \{1,\ldots,n\}$) for the $i$-th coordinate function of $f$, then for each $j \in \{1,\ldots,m\}$, the partial derivative $\,\partial_jf_i\,$ of $\,f_i\,$ in its $j$-th input exists at $\mathbf{x}$, and $$ \boxed{(i,j)\textrm{-entry of } Df_\mathbf{x} \ = \ \partial_jf_i(\mathbf{x}).} $$ In the case that $f$ is scalar-valued, the derivative $Df_\mathbf{x}$ is simply a row vector, namely the transpose of $\nabla f$.

Theorem. Suppose we have $g \colon \mathbb{R}^l \to \mathbb{R}^m$ and $f \colon \mathbb{R}^m \to \mathbb{R}^n$ such that $g$ is differentiable at a point $\mathbf{x} \in \mathbb{R}^l$ and $f$ is differentiable at $g(\mathbf{x})$. Then $f \circ g$ is differentiable at $\mathbf{x}$, with $$ \boxed{D(f \circ g)_\mathbf{x} \ = \ Df_{g(\mathbf{x})}Dg_\mathbf{x}} $$ Hence in particular, for each $i \in \{1,\ldots,n\}$ and $j \in \{1,\ldots,l\}$, taking the $(i,j)$-entry of the LHS and the RHS gives: \begin{align*} &\left.\frac{\partial f_i(g(t_1,\ldots,t_l))}{t_j}\right|_{(t_1,\ldots,t_l)=\mathbf{x}} \\ & \hspace{5mm} = \ \sum_{r=1}^m \left( \left.\frac{\partial f_i(y_1,\ldots,y_m)}{y_r}\right|_{(y_1,\ldots,y_m)=g(\mathbf{x})} \right) \!\! \left( \left.\frac{\partial g_r(t_1,\ldots,t_l)}{t_j}\right|_{(t_1,\ldots,t_l)=\mathbf{x}} \right) \end{align*}

Note how the boxed formula is the exact parallel of the one-dimensional chain rule:

$$ (f \circ g)'(x) \ = \ f'(g(x))g'(x). $$

The intuition behind this theorem is fairly straightforward: The increment $g(\mathbf{x}+\mathbf{h})-g(\mathbf{x})$ in $\mathbb{R}^m$ is approximately equal to $Dg_\mathbf{x}$ times the increment $\mathbf{h}$ in $\mathbb{R}^l$; and therefore the increment $f(g(\mathbf{x}+\mathbf{h}))-f(g(\mathbf{x}))$ in $\mathbb{R}^n$ is approximately equal to $Df_{g(\mathbf{x})}$ times the approximate increment $Dg_\mathbf{x}\mathbf{h}$ in $\mathbb{R}^m$.

In the case that $f$ is scalar-valued, expressing the formula using $\nabla$ notation gives

$$ \boxed{(\nabla(f \circ g))(\mathbf{x}) \ = \ (Dg_\mathbf{x})^T (\nabla f)(g(\mathbf{x}))} $$

(You can probably see why I find the "general version" more intuitive.) In other words, \begin{align*} & \textrm{$j$-th entry of } (\nabla(f \circ g))(\mathbf{x}) \ = \ \left.\frac{\partial f(g(t_1,\ldots,t_l))}{t_j}\right|_{(t_1,\ldots,t_l)=\mathbf{x}} \ = \\ & \hspace{8mm} \ \sum_{r=1}^m \left( \left.\frac{\partial f(y_1,\ldots,y_m)}{y_r}\right|_{(y_1,\ldots,y_m)=g(\mathbf{x})} \right) \!\! \left( \left.\frac{\partial g_r(t_1,\ldots,t_l)}{t_j}\right|_{(t_1,\ldots,t_l)=\mathbf{x}} \right) \end{align*}

(In your case, you happen to have that $m$ and $l$ are the same number, namely what you have called $n$ in your question.)


Proof of the Theorem. [Throughout this proof, I work with the convenient convention that dividing a positive number by $0$ gives $\infty$; this prevents unnecessary splitting between the cases where $\mathbf{x}$ is and is not a stationary point of $g$ and $g(\mathbf{x})$ is and is not a stationary point of $f$.]

We need that for any $\varepsilon>0$, if $\mathbf{h} \in \mathbb{R}^l$ is sufficiently small then $$ |\,\underbrace{f(g(\mathbf{x}+\mathbf{h}))-f(g(\mathbf{x}))-Df_{g(\mathbf{x})}Dg_\mathbf{x}\mathbf{h}}_{=:\,\mathcal{E}(\mathbf{h})}\,| \ \leq \ \varepsilon|\mathbf{h}|. $$ Define $\mathrm{Rem}_{g,\mathbf{x}} \colon \mathbb{R}^l \to \mathbb{R}^m$ by $$ \hspace{-6mm} \mathrm{Rem}_{g,\mathbf{x}}(\mathbf{h}) \ = \ g(\mathbf{x}+\mathbf{h})-g(\mathbf{x})-Dg_\mathbf{x}\mathbf{h} $$ and likewise $\mathrm{Rem}_{f,g(\mathbf{x})} \colon \mathbb{R}^m \to \mathbb{R}^n$ by $$ \hspace{7mm} \mathrm{Rem}_{f,g(\mathbf{x})}(\mathbf{v})\ = \ f(g(\mathbf{x})+\mathbf{v})-f(g(\mathbf{x}))-Df_{g(\mathbf{x})}\mathbf{v}. $$ Then $$ \mathcal{E}(\mathbf{h}) \ = \ Df_{g(\mathbf{x})}\mathrm{Rem}_{g,\mathbf{x}}(\mathbf{h}) + \mathrm{Rem}_{f,g(\mathbf{x})}(Dg_\mathbf{x}\mathbf{h}+\mathrm{Rem}_{g,\mathbf{x}}(\mathbf{h})) $$ for all $\mathbf{h} \in \mathbb{R}^l$. Now fix $\varepsilon>0$. Let $\delta_1>0$ be such that $$ 0<|\mathbf{h}|<\delta_1 \ \Rightarrow \ |\mathrm{Rem}_{g,\mathbf{x}}(\mathbf{h})| \leq \frac{\varepsilon}{2\|Df_{g(\mathbf{x})}\|}|\mathbf{h}|. $$ Let $\tilde{\delta}>0$ be such that $$ |\mathbf{v}|<\tilde{\delta} \ \Rightarrow \ |\mathrm{Rem}_{f,g(\mathbf{x})}(\mathbf{v})| \leq \frac{\varepsilon}{2(\|Dg_\mathbf{x})\|+1)}|\mathbf{v}|. $$ Let $\delta_2=\frac{\tilde{\delta}}{2\|Dg_\mathbf{x}\|}$ and let $\delta_3>0$ be such that $$ |\mathbf{h}|<\delta_3 \ \Rightarrow \ |\mathrm{Rem}_{g,\mathbf{x}}(\mathbf{h})| \leq \min(|\mathbf{h}|,\tfrac{\tilde{\delta}}{2}). $$ Then setting $\delta:=\min(\delta_1,\delta_2,\delta_3)$, we have $$ |\mathbf{h}|<\delta \ \Rightarrow \ |Df_{g(\mathbf{x})}\mathrm{Rem}_{g,\mathbf{x}}(\mathbf{h}) + \mathrm{Rem}_{f,g(\mathbf{x})}(Dg_\mathbf{x}\mathbf{h}+\mathrm{Rem}_{g,\mathbf{x}}(\mathbf{h}))| \leq \varepsilon|\mathbf{h}| $$ as required. $\ \ \square$

0
On

Assuming that you are given the Jacobian $(J)$ of $g,\,$ and the gradient $(h)$ of $f$ $$J = \frac{\partial g}{\partial x} \quad\implies dg = J\,dx \\ h = \frac{\partial f}{\partial g} \quad\implies df = h^Tdg$$ Then combining these two quantities yields the desired result $$\eqalign{ df &= h^Tdg \\&= h^T(J\,dx) \\&= (J^Th)^Tdx \quad\implies \frac{\partial f}{\partial x} &= J^Th \\ }$$

5
On

Everyone is giving the answer with proofs, but let me explain WHY the answer is what it is!

For a differentiable map $h: \mathbb{R}^n \to \mathbb{R^m}$, I think of $Dh(x)=\nabla h(x)$ as a linear map that takes a vector $v$ (with its base at $x$) to the vector $Dh(x)(v) \in \mathbb{R}^m$ based at $h(x)$. This value in fact is equal to the partial derivative of $h$ at $x$ in direction $v$. When $Dh(x)$ is represented as a matrix, then $Dh(x)v$ is a matrix multiplication.

Now, you have the composition $\mathbb{R}^n \to \mathbb{R}^n \to \mathbb{R^1}$. Fix an $x \in \mathbb{R}^n$ that gets mapped via $x \to g(x) \to (f\circ g)(x)$. If you take a vector $v \in \mathbb{R}^n$ based at $x$ then derivative of $g$ at $x$ sends it to the vector $w:=Dg(x)v \in \mathbb{R}^n$, a vector based at $y:=g(x)$. Now, what does $f$ do to it? It sends it to $Df(y)w $ therefore, the initial vector $v$ ended up as $$ Df(g(x))\cdot(Dg(x)v) \ ,$$ read: derivative of $f$ at $g(x)$ applied to the vector $Dg(x)v$. Because $Dg(x)$ and $Df(g(x))$ are linear maps, this is the composition of, i.e. matrix multiplication of, the two, applied to $v$.