Order of multiplication in chain rule?

450 Views Asked by At

I am trying to figure out the order of multiplying things when taking the chain rule.

For example, if I have a function $f:\mathbb{R}^n\rightarrow \mathbb{R}$.

I take the gradient of a composite with $g:\mathbb{R}^n\rightarrow \mathbb{R}$, where $h(y)=f(g(y))=f(x_o+Py)$ and so $g=x_o+Py$, then is $\nabla h=P\nabla f$ or is it $\nabla h = \nabla f P$? (obviously $\nabla g=P$).

In wikipedia you can see it is inconsistent: chain rule from wikipedia It shows in the first example that $\nabla f(a)$ is first, and then in the second example, it shows that $\nabla f(a)$ is second. What is the correct order?

2

There are 2 best solutions below

4
On BEST ANSWER

In the following, I am assuming that $\nabla f$ is used to denote the gradient of a functional (function taking values in $\mathbb{R}$). (This is also a lot more long winded than originally intended!)

In $\mathbb{R}^n$ the difference between the derivative and the gradient for a functional are often blurred, which leads to confusion.

The distinction is blurred because, to some extent, the differences can be viewed as different conventions (a row vector vs. a column vector). The distinction becomes more important in non Hilbert spaces.

In terms of derivatives, we have $D h(y) = D f(g(y)) D g(y) = Df(x_0+Py) P$. (Note that $Dg(y) = P$ since $g$ is linear).

Note that for any linear functional $\phi: \mathbb{R}^n \to \mathbb{R}$, there is a unique element $w \in \mathbb{R}^n$ such that $\phi(x) = w^T x$.

The derivative $Df(x)$ is a linear functional, so there is a unique element so it can be written as above, we use the notation $\nabla f(x)$ (the gradient) to denote this unique element.

That is, $Df(x)h = \nabla f(x)^T h$.

Now we want to compute $\nabla h(y)$.

Note that $D h(y) \eta = Df(x_0+Py) P \eta = \nabla f(x_0+Py)^T P\eta$, so we can write $Dh(y)\eta = (P^T \nabla f(x_0+Py))^T \eta$ (note the transpose on $P$) so we have $\nabla h(y) = P^T \nabla f(x_0+Py)$.

Personally, I use the gradient representation when dealing with things from a geometric perspective, but generally use the derivative when computing.

0
On

It seems that $g$ is given by $$g:\quad {\mathbb R}^n\to{\mathbb R}^n,\qquad y\mapsto g(y)=a+P.y$$ for a certain $a\in{\mathbb R}^n$ and a linear $P$. Then $h$ is defined by $$h(y):=f\bigl(g(y)\bigr)\ .$$ The chain rule says that $$dh(y)=df\bigl(g(y)\bigr)\circ dg(y)\ .\tag{1}$$ Now $$dg(y)=P\quad\forall\>y,\qquad{\rm and}\qquad df(x).X=\langle \nabla f(x),X\rangle\quad\forall X\ .$$ Plugging this into $(1)$ we obtain $$\langle\nabla h(y),Y\rangle=dh(y).Y=df\bigl(g(y)\bigr).\bigl(dg(y).Y\bigr)=\langle\nabla f\bigl(g(y)\bigr),PY\rangle=\langle P^\top\nabla f\bigr(g(y)\bigr),Y\rangle\ .$$ Since this is true for all $Y\in T_y$ we conclude that $$\nabla h(y)=P^\top\nabla f\bigl(g(y)\bigr)\ .$$