I am trying to figure out the order of multiplying things when taking the chain rule.
For example, if I have a function $f:\mathbb{R}^n\rightarrow \mathbb{R}$.
I take the gradient of a composite with $g:\mathbb{R}^n\rightarrow \mathbb{R}$, where $h(y)=f(g(y))=f(x_o+Py)$ and so $g=x_o+Py$, then is $\nabla h=P\nabla f$ or is it $\nabla h = \nabla f P$? (obviously $\nabla g=P$).
In wikipedia you can see it is inconsistent:
It shows in the first example that $\nabla f(a)$ is first, and then in the second example, it shows that $\nabla f(a)$ is second. What is the correct order?
In the following, I am assuming that $\nabla f$ is used to denote the gradient of a functional (function taking values in $\mathbb{R}$). (This is also a lot more long winded than originally intended!)
In $\mathbb{R}^n$ the difference between the derivative and the gradient for a functional are often blurred, which leads to confusion.
The distinction is blurred because, to some extent, the differences can be viewed as different conventions (a row vector vs. a column vector). The distinction becomes more important in non Hilbert spaces.
In terms of derivatives, we have $D h(y) = D f(g(y)) D g(y) = Df(x_0+Py) P$. (Note that $Dg(y) = P$ since $g$ is linear).
Note that for any linear functional $\phi: \mathbb{R}^n \to \mathbb{R}$, there is a unique element $w \in \mathbb{R}^n$ such that $\phi(x) = w^T x$.
The derivative $Df(x)$ is a linear functional, so there is a unique element so it can be written as above, we use the notation $\nabla f(x)$ (the gradient) to denote this unique element.
That is, $Df(x)h = \nabla f(x)^T h$.
Now we want to compute $\nabla h(y)$.
Note that $D h(y) \eta = Df(x_0+Py) P \eta = \nabla f(x_0+Py)^T P\eta$, so we can write $Dh(y)\eta = (P^T \nabla f(x_0+Py))^T \eta$ (note the transpose on $P$) so we have $\nabla h(y) = P^T \nabla f(x_0+Py)$.
Personally, I use the gradient representation when dealing with things from a geometric perspective, but generally use the derivative when computing.