Assume I have a probability density function $p(x)$ for $x \in \mathbb{R}^d$ and a transformation function $T:\mathbb{R}^d \rightarrow \mathbb{R}^d$
$$T(x)=x+\epsilon v(x)$$
where $\epsilon$ is a small scalar increment and $v(x)$ an evaluation of a smooth vector field $v$ at $x$. Assume further that I have $\nabla_x=[\frac{d}{d x_1},\frac{d}{d x_2},...,\frac{d}{d x_d}]^T$ and $\nabla_\epsilon=\frac{d}{d \epsilon}$.
I am trying to understand a derivation in which the following identity occurs:
$$\nabla_\epsilon p(T(x))=[\nabla_xp(T(x))]^T\cdot\nabla_\epsilon T(x)$$
How did the authors arrive here? The dimensions check out: $\nabla_\epsilon p(T(x))$ is a scalar, and the RHS is an inner product of two $d$-dimensional vectors which also yields a scalar. I assume the chain rule of differentiation was used here to obtain $\nabla_\epsilon T(x)$, but why do we obtain the first RHS vector? Where does the $\nabla_x$ come from?
I would appreciate any advice or help in understanding this.
The identity can indeed by derived using the chain rule for the composition of a scalar field $f$ and a vector field $g$. Using the $\nabla_x$ notation, this case of the chain rule is $$ (f\circ g)_i(x) = (\nabla_x f)(g(x)) \cdot g_i(x) $$ The right side is the dot product of the vector $(\nabla_x f)(g(x))$ and the vector $g_i(x)$. The subscript $i$ denotes the partial derivative with respect to variable $i$.
In your case, the scalar field is $p$ and the vector field is $T(x,\epsilon)$. The way it is used in your identity, $T$ is really a mapping from $\mathbb{R}^{d+1}$ to $\mathbb{R}^d$. The $\nabla_{\epsilon}$ operator is just the partial derivative with respect to variable $d+1$. The left side of your identity is $(p\circ T)_{d+1}(x,\epsilon)$ and the chain rule applied to this is $$ (p\circ T)_{d+1}(x,\epsilon) = (\nabla_x p)(T(x,\epsilon)) \cdot (\nabla_{\epsilon}T)(x,\epsilon). $$