Confusion with Matrix calculus derivative computation.

115 Views Asked by At

The text I am reading has the following

$$E=(I+\nabla I\frac{\partial W}{\partial p}\Delta p -T)^2$$ where $\nabla I$ us a row vector $1$ by $2$, $\frac{\partial W}{\partial p}$ is a $2$ by $3$ matrix and $\Delta p$ is a $3$ by $1$ column vector. $I$ and $T$ are scalars. We seek the gradient $\frac{\partial E}{\partial \Delta p_k}$, but I do not understand their solution. I will present my attempt, and than their solution.

In index notation, we have $(I+\nabla I_i\frac{\partial W^i}{\partial \Delta p_k}\Delta p^k -T)^2$, where summation if implied by repeated indicies, Than the derivative is

$$\frac{\partial E}{\partial \Delta p_m}=2(I+\nabla I_i\frac{\partial W^i}{\partial p_k}\Delta p^k -T)\nabla I_n\frac{\partial W^n}{\partial p_m}$$ or returning to the matrix notation

$$\frac{\partial E}{\partial \Delta p}=2(I+\nabla I\frac{\partial W}{\partial p}\Delta p -T)\nabla I\frac{\partial W}{\partial p}\tag{1}\label{1}$$

But their solution is

$$\frac{\partial E}{\partial \Delta p}=2(\nabla I\frac{\partial W}{\partial p})^T(I+\nabla I\frac{\partial W}{\partial p}\Delta p -T)\tag{2}\label{2}$$

Why is their solution correct and where did I make the mistake? My solution is equation \eqref{1}, but theirs is \eqref{2}. I am confused, why in theirs, $$(\nabla I\frac{\partial W}{\partial p})$$ is transposed?

1

There are 1 best solutions below

2
On BEST ANSWER

It looks like the text you are reading uses so-called Denominator layout, for matrix calculus notation, i.e. given two column vectors $\boldsymbol{x}\in\mathbb{R}^m$ and $\boldsymbol{y}\in\mathbb{R}^n$ of the size $m\times1$ and $n\times1$ respectively we write derivative $\displaystyle\dfrac{\partial \boldsymbol{y}}{\partial\boldsymbol{x}}$ as $n\times m$ matrix. In other words, the layout is according to $\boldsymbol y^{\boldsymbol\top}$ and $\mathbf{x}$.

In your case $E$ is the scalar, so that $n=1$ and $m=3$, and thus the derivative of scalar w.r.t. the vector $\Delta p$ has to be a column vector of the size $3\times1$.

More details are available, for example, on the Wikipedia page for Matrix Calculus.


Moreover, using provided there table of scalar-by-vector identities one can figure out dimensionality of each term emerging from the chain rule explicitly. According to the linked table,

Assume $\boldsymbol{b}\in \mathbb R^m$ and $\boldsymbol x \in \mathbb{R}^n$ are column vectors, and matrix $\boldsymbol A$ is in $\mathbb{R}^{m\times n}$. If $\boldsymbol A$ and $\boldsymbol b$ do not depend on $\boldsymbol x$, then the following holds (in Denominator layout ): $$ \dfrac{\partial\left(\boldsymbol b^{\boldsymbol\top}\boldsymbol A \boldsymbol x\right)}{\partial \boldsymbol x} = \boldsymbol A^{\boldsymbol \top} \boldsymbol b = \left(\boldsymbol b^{\boldsymbol\top}\boldsymbol A\right)^{\boldsymbol\top}$$

Using this identity, you can easily see that $$ \frac{\partial }{\partial \Delta p} \left( \nabla I\frac{\partial W}{\partial p}\Delta p\right) = \left( \nabla I\frac{\partial W}{\partial p}\right)^{\boldsymbol\top} $$