Differentiate vector transpose using rules

153 Views Asked by At

I am referring to Tom Minka's Old and New Matrix Algebra Useful for Statistics. I don't have the book by Magnus & Neudecker so I can't refer to the details of the theory.

Regarding rules (6): $d(XY) = (dX)Y + X(dY)$ and (12): $dX^*=(dX)^*$, I am not clear how to apply them. My notation used is numerator layout, i.e. $\dfrac{dx}{dx} = I$

Question 1.

$f(x)=x^Tx$ , $\dfrac{df}{dx}=2x^T$

However, if I use $\dfrac{df}{dx}= x^T\dfrac{dx}{dx} + \dfrac{dx^T}{dx}x$, firstly, $\dfrac{dx^T}{dx}$ is $1^T$? Second, according to rule (12), $\dfrac{dx^T}{dx} = (\dfrac{dx}{dx})^T = I^T = I $?

Question 2.

$f(x) = x^TAx$

$\dfrac{df}{dx}=x^T\dfrac{dAx}{dx}+ \dfrac{dx^T}{dx}(Ax) = x^TA + ???$

$???$ is supposed to be $x^TA^T$, however, it seems to me no matter $\dfrac{dx^T}{dx}$ equals $1^T$ or $I$ it does not give the expected result.

2

There are 2 best solutions below

5
On
  • Let's use the numerator-layout notation. First note that $\frac{dx}{dx}=I$ but $\frac{dx^T}{dx}=\begin{bmatrix}\begin{pmatrix}1&0&...&0\end{pmatrix},\begin{pmatrix}0&1&0&...&0\end{pmatrix},...,\begin{pmatrix}0&0&...&0&1\end{pmatrix}\end{bmatrix}$, a tensor, technically 1 x n x n. In denominator layout fashion, $\frac{dx^T}{dx}=\left[\begin{pmatrix}1\\0\\...\\0\end{pmatrix},\begin{pmatrix}0\\1\\0\\...\\0\end{pmatrix},...,\begin{pmatrix}0\\...\\0\\1\end{pmatrix}\right]$, a n x 1 x n tensor. It is possible to imagine it as a 3D matrix with the entries behind one another rather than listed liked this.

  • The inner product is symmetric, e.g. $x^Ty=y^Tx=\langle x, y\rangle$. We have the following four scenarios directly applying the derivative to $x^Tx$ and $x^TAx$:

$$\begin{matrix}&\text{denominator layout}&\text{numerator layout}\\ \frac{d}{dx}x^Tx&\frac{dx}{dx}x+\frac{dx}{dx}x=2x&x^T\frac{dx}{dx}+x^T\frac{dx}{dx}=2x^T\\ \frac{d}{dx}x^TAx&\frac{dAx}{dx}x+\frac{dx}{dx}Ax=(A^T+A)x&x^T\frac{dAx}{dx}+x^TA^T\frac{dx}{dx}x^T(A+A^T)\end{matrix}$$

Therefore the rule is $\frac{d}{dx}\langle x, y\rangle=\frac{dx}{dx}y+\frac{dy}{dx}x$ in denominator layout and $\frac{d}{dx}\langle x, y\rangle=x^T\frac{dy}{dx}+y^T\frac{dx}{dx}$ in numerator layout. Multiplying through by $dx$ suggests that $d\langle x, y\rangle=(dx)y+(dy)x$ in denominator layout but $d\langle x, y\rangle=x^T(dy)+y^T(dx)$ in numerator layout. Therefore it is doubtful that $d\langle x, y\rangle$ is a scalar and you could freely take the transpose.

Conclusion

You should directly use the product rule.

4
On

These rules pertain to differentials not to gradients.

Let's use them properly, starting with your second example function. $$\eqalign{ f_2 &= x^TAx \\ df_2 &= dx^TAx+x^TA\,dx \\ &= (Ax)^Tdx+(A^Tx)^Tdx \\ &= (Ax+A^Tx)^Tdx \\ \frac{\partial f_2}{\partial x} &= (Ax+A^Tx) \\ }$$ Setting $A=I$ turns this into your first function. Therefore $$\eqalign{ \frac{\partial f_1}{\partial x} &= (Ix+I^Tx) \;=\; 2x \\\\ }$$ There are no corresponding rules for gradients, because a gradient operation changes a vector into a matrix, and matrix multiplication is not commutative. Trying to apply the rules to gradients produces nonsense, as you have discovered.