Derivative of Dot Product as matrix multiplication

208 Views Asked by At

I've come across this definition when looking into how to differentiate parameter vectors in statistics.

Given $ \pmb{x}^{T} \pmb{x}$ $$\frac{\partial (\pmb{x}^{T} \pmb{x}) }{\partial \pmb{x}}=2\ \pmb{x}^{T}$$

And the proofs I've seen utilize the product rule while holding $\pmb{x}^{T}$ and $\pmb{x}$ constant. (Source: http://www.cs.huji.ac.il/~csip/tirgul3_derivatives.pdf)

$$\frac{\partial (\pmb{x}^{T} \pmb{x}) }{\partial \pmb{x}}= \frac{\partial ({x}^{T} \pmb{x}) }{\partial \pmb{x}} + \frac{\partial (\pmb{x}^{T} x) }{\partial \pmb{x}}= \pmb{x}^{T} + \pmb{x}^{T} = 2\pmb{x}^{T} $$

I understand how we get $\pmb{x}^{T}$ if we differentiate $\frac{\partial ({x}^{T} \pmb{x}) }{\partial \pmb{x}}$.

What I'm not seeing is how we get $\frac{\partial (\pmb{x}^{T} x) }{\partial \pmb{x}} = \frac{\partial (\pmb{x}^{T}) }{\partial \pmb{x}}x=\pmb{x}^{T}$

  • Why is it that $\pmb{x}$ becomes $\pmb{x}^{T}$ when we differentiate $\pmb{x}^{T}$ with respect to $\pmb{x}$ ?
3

There are 3 best solutions below

2
On

As $x^Tx$ is a scalar, $\color{blue}x^Tx=(\color{blue}x^Tx)^T=x^T\color{blue}x.$

0
On

Consider the trace/Frobenius product (denoted by a colon) $$A:B = {\rm Tr}(A^TB)$$ As long as the matrices $(A,B)$ have same number of rows/columns, they can have any shape: tall-and-thin, square, short-and-fat. And of course, they can be row or column vectors in which case one recovers the ordinary dot-product $$a:b = {\rm Tr}(a^Tb) = a\cdot b$$ The properties of the trace allow the terms of a Frobenius product to be rearranged in many ways $$\eqalign{ A:BC &= B^TA:C = AC^T:B \\ A:B &= A^T:B^T \\ A:B &= B:A \\ }$$ For this particular problem, simply take the differential of the Frobenius product and then use the commutative property. $$\eqalign{ \phi &= x:x \\ d\phi &= x:dx + dx:x \\&= 2x:dx \\ \frac{\partial \phi}{\partial x} &= 2x \\ }$$ (Note that this result remains valid when $x$ is a matrix instead of a vector)

Due to a different choice of layout convention, some authors write this gradient as $$\frac{\partial \phi}{\partial x} = 2x^T$$

0
On

Let $f(\mathbf{x})=\mathbf{x}^\top\mathbf{x}$. If write $\mathbf{x}=\mathbf{x}_0+\epsilon \mathbf{y}$ in the neighborhood of a point $\mathbf{x}_0$ then we may write \begin{align} f(\mathbf{x}) &=\mathbf{x}^\top\mathbf{x}\\ &=(\mathbf{x}_0+\epsilon \mathbf{y})^\top (\mathbf{x}_0+\epsilon \mathbf{y})\\ &=\mathbf{x}_0^\top \mathbf{x}_0+(2\mathbf{x}^\top)(\epsilon\mathbf{y})+\epsilon^2 \|\mathbf{y}\|^2\\ &=f(\mathbf{x}_0)+(2\mathbf{x}_0^\top) (\mathbf{x}-\mathbf{x}_0)+o(\epsilon^2). \end{align} In other words, the function $L(\mathbf{x})=f(\mathbf{x}_0)+(2\mathbf{x}_0^\top)(\mathbf{x}-\mathbf{x}_0)$ represents the best linear approximation to $f(\mathbf{x})$ in the neighborhood of $\mathbf{x}_0$. As such, we conclude that $\dfrac{df}{d\mathbf{x}}=2\mathbf{x}^\top$.