Gradient of product of matrices

3.5k Views Asked by At

I am reading Duchi's notes$^\color{red}{\star}$ and trying to understand why

$$\nabla_A (A B) = B^\top, \qquad \nabla_A \mbox{tr} (A B) = B^\top$$

and why they are the same. Can someone please explain how to derive the gradient of a matrix product and what appropriate dimensions for this gradient are?

The trace being a scalar value and gradient dimension being the dimension of transpose of $B$ makes sense to me since it should be the same as dimension of $A$. But I cant seem to understand how to get gradient of product of matrices and the dimension.


$\color{red}{\star}$ John Duchi, Properties of the Trace and Matrix Derivatives

2

There are 2 best solutions below

12
On

Suppose $f(A)=\operatorname{tr} (AB)$, then $f(A+H)-F(A) = \operatorname{tr} (HB)$, so we have $Df(A)(H) = \operatorname{tr} (HB)$. (Not surprisingly, since trace is linear.)

In a Hilbert space, the gradient of a functional is an element $\nabla f(A)$ such that $Df(A)(H) = \langle \nabla f(A), H \rangle$ for all $H$.

Since $\langle X, Y \rangle = \operatorname{tr} (X^T Y)$, we see that $\nabla f(A) = B^T$.

This is entirely analogous to a function $g : \mathbb{R}^n \to \mathbb{R}$. The derivative is usually written as a row vector while the gradient is a column vector.

Addendum:

Let $f(A) = \operatorname{tr} (A B A^T C)$. Then we have $f(A+H)-f(A) = \operatorname{tr} (H B A^T C)+\operatorname{tr} (A B H^T C)+\operatorname{tr} (H B H^T C)$. The last term is of order $O(\|H\|^2)$, so we see that $Df(A)(H) = \operatorname{tr} (H B A^T C)+\operatorname{tr} (A B H^T C) $.

The relevant properties of trace are that (i) transpose invariance $\operatorname{tr} X = \operatorname{tr} X^T$ and (ii) shift invariance $\operatorname{tr} (X_1 ... X_n) = \operatorname{tr} (X_2...X_n X_1)$.

Applying these gives \begin{eqnarray} Df(A)(H) &=& \operatorname{tr} ((C^T A B^T)^T H)+\operatorname{tr} ((CAB)^TH) \\ &=& \langle C^T A B^T + CAB, H \rangle \end{eqnarray} from which we get the gradient to be $\nabla f(A) = C^T A B^T + CAB$.

5
On

The gradient of a matrix wrt a matrix results in a 4th order tensor.

It can be calculated from the differential $$\eqalign{ C &= AB \cr dC &= dA\,B = {\mathcal H}B^T:dA \cr \frac{\partial C}{\partial A} &= {\mathcal H}B^T \cr }$$ where ${\mathcal H}$ is a 4th order isotropic tensor whose components can be expressed in terms of Kronecker deltas $$\eqalign{ {\mathcal H}_{ijkl} &= \delta_{ik}\,\delta_{jl} \cr }$$ The colon is used to represent the double-contraction product, while juxtaposition represents a single-contraction product. In terms of components $$\eqalign{ M &= {\mathcal H}:X &\implies M_{ij} = {\mathcal H}_{ijkl}\,X_{kl} \cr {\mathcal P} &= {\mathcal H} X &\implies {\mathcal P}_{ijkm} = {\mathcal H}_{ijkl}\,X_{lm} \cr }$$ The trace is just a double-contraction with the identity matrix, i.e. $${\rm tr}(X) = I:X$$ Therefore $$\eqalign{ {\rm tr}\bigg(\frac{\partial C}{\partial A}\bigg) &= \frac{\partial\,{\rm tr}(C)}{\partial A} = I:{\mathcal H}B^T = B^T \cr }$$