What is the derivative of $\mathbf{a}^T\mathbf{X}^2\mathbf{b}$ wrt the matrix $\mathbf{X}$?

79 Views Asked by At

Given vectors $\mathbf{a}$ and $\mathbf{b}$, I am looking for the derivative of the following scalar function

$$y(\mathbf{X}) = \mathbf{a}^T\mathbf{X}^2\mathbf{b}$$

with respect to matrix $\mathbf{X}$. I couldn't find a direct answer from Wikipedia.

4

There are 4 best solutions below

2
On BEST ANSWER

From the wikipedia page, we have the result $$ \frac{\partial \operatorname{Tr}(AX^n)}{\partial X} = \sum_{i=0}^{n-1}X^iAX^{n-i-1}. $$ We can rewrite $$ y(X) = a^TX^2b = \operatorname{Tr}(a^TX^2 b) = \operatorname{Tr}([ba^T]X^2), $$ so that plugging in $n=2$ and $A = ba^T$ into the wikipedia result yields $$ \frac{\partial y}{\partial X} = X^0AX^1 + X^1AX^0 = AX + XA = ba^TX + Xba^T. $$


To compare this to the other answer: using the table at the end of this section yields $$ \frac {dy}{dX} = ba^TX + Xba^T \implies\\ dy = \operatorname{Tr}([ba^TX + Xba^T]\,dX) = \operatorname{Tr}(ba^TXdX) + \operatorname{Tr}(Xba^TdX) = \\ \operatorname{Tr}(a^TXdX\,b) + \operatorname{Tr}(a^TdX\,Xb) = a^TX(dX)b + a^T(dX) X\,b = \\ a^T[X(dX) + (dX)X]b $$ which matches the other result.

1
On

If $Z$ is any matrix then $$y(X+hZ)-y(X)=a^T(X+hZ)^2b -a^TX^2b=a^T(h(XZ+ZX)+h^2Z^2)b$$ so that $$\lim_{h\to0}\frac{y(X+hZ)-y(X)}{h}=a^T(XZ+ZX)b $$

0
On

Use a colon to denote the trace/Frobenius product, i.e. $$\eqalign{ Y:Z &= \operatorname{Tr}(Y^TZ) }$$ Write the function $(f)$ and calculate its differential $(df)$ and gradient $(G)$. $$\eqalign{ f &= \operatorname{Tr}\left(a^TX^2b\right) \\ &= a:XXb \\ &= ab^T:XX \\ \\ df &= ab^T:(dX\,X + X\,dX) \\ &= (ab^TX^T+X^Tab^T):dX \\ &= G:dX \\ \\ \frac{\partial f}{\partial X} &= G \\ &= ab^TX^T+X^Tab^T \\ }$$

0
On

You want to compute the derivative of a map from a finite dimensional ($\mathbb{R}$-) vector space V to the real numbers. In this case, the derivative is given by the gradient, i.e., if $e_{\alpha}$ for $\alpha$ in some finite index set are the basis vectors of $V$ we have that

$$Df(X)=\{Z\mapsto\nabla f(X)\cdot Z\}\quad\text{where}$$

$$\nabla f(X)\cdot Z=\sum_{\alpha}(\nabla f(X))_{\alpha}\cdot Z_{\alpha}\quad\text{and}\quad(\nabla f(X))_{\alpha}=\frac{\partial f}{\partial e_{\alpha}}(X)\,.$$

In your case, the basis vectors are the matrices $E_{ij}=\delta_{ij}$, where $\delta_{ij}$ is the Kronecker delta. Hence, the only complication is, that our basis indices are multiindices. All we have to do is to compute the gradient of $y$. We have

$$y(X)=\sum_{k,j,i}a_{k}X_{kj}X_{ji}b_{i}\,.$$

Using that $\frac{\partial y}{\partial E_{ij}}=\frac{\partial y}{\partial X_{ij}}$, we get

$$(\nabla y(X))_{mn}=\frac{\partial y}{\partial X_{mn}}=\sum_{kji}a_{k}\delta_{mn,kj}X_{ji}b_{i}+\sum_{kji}a_{k}X_{kj}\delta_{mn,ji}b_{i}\,.$$

The first summand is only non-zero when $m=k$ and $n=j$, and the second summand is only non-zero if $m=j$ and $n=i$, which leads to

$$(\nabla y(X))_{mn}=\sum_{i}a_{m}X_{ni}b_{i}+\sum_{k}a_{k}X_{km}b_{n}=a_{m}(Xb)_{n}+(X^{T}a)_{m}b_{n}\,.$$ Realising that we can write the last term using outer products, we get that $$\nabla y(X)=a(Xb)^{T}+(X^{T}a)b^{T}\,.$$

Note that, in order to apply the gradient to a vector, i.e., to compute the derivative, you need to take the dot product: $$\nabla y(X)\cdot Z=\sum_{i,j}(\nabla y(X))_{ij}Z_{ij}\,.$$