Non-elementwise Matrix Derivatives

146 Views Asked by At

Let A,B,C,D,X be matrices. I'd like to perform a Gradient Descent minimization to the loss functin $$ tr[(AXBX^TC-D)^T(AXBX^TC-D)] $$

My question is, how to take the gradient efficiently w.r.t. $B$? I have came up with the following: $$ \frac {\partial} {B_{ij}} tr[(AXBX^TC-D)^T(AXBX^TC-D)] =$$ $$ tr[(\frac {\partial} {B_{ij}}(AXBX^TC-D)^T)(AXBX^TC-D)]+$$ $$tr[(AXBX^TC-D)^T\frac {\partial} {B_{ij}}(AXBX^TC-D)] =$$ recalling that $trX^T=trX$: $$ 2tr[(\frac {\partial} {B_{ij}}(AXBX^TC-D)^T)(AXBX^TC-D)]=$$ $$ 2tr[((AX(\frac {\partial} {B_{ij}}B)X^TC)^T)(AXBX^TC-D)]=$$ $$ 2tr[((AXJ^{ij}X^TC)^T)(AXBX^TC-D)]$$ where $J^{ij}$ is all zero except that the $i,j$ element is 1. The problem is that I need to have an expression for whole $B$ rather than only $B_{ij}$ so I won't need to do all this multiplication as the amount of $B$'s elements when I update $B$ by Gradient Descent.

I have read many material (including the matrix cookbook) and still very confused.

Thanks for your help!!!

2

There are 2 best solutions below

2
On BEST ANSWER

You can always absorb $X$ into $A$ and $X^\top$ into $C$, so that $X$ is eliminated in your objective function. Now, using the cyclic property of the trace of matrix product, you can rewrite the objective function as \begin{align*} f(B)&=\operatorname{tr}\left((ABC-D)^\top(ABC-D)\right)\\ &=\operatorname{tr}\left(B^\top A^\top ABCC^\top\right) - 2\operatorname{tr}\left(B^\top A^\top DC^\top\right) +\operatorname{tr}\left(D^\top D\right)\tag{1} \end{align*} In general, if $M$ is some constant matrix, the derivative of $\operatorname{tr}(B^\top M)$ with respect to $B$ is $M$. Using the product rule to handle the first summand in $(1)$, we see that the derivative of $f(B)$ with respect to $B$ is $2(A^\top ABCC^\top - A^\top DC^\top)$ for the new $A$ and $C$. So, with the original $A$ and $C$, the derivative should be $$2(X^\top A^\top AXBX^\top CC^\top X - X^\top A^\top DC^\top X).$$

0
On

The function is of the form $f(B) = \operatorname{tr}(ABC-D)^T(ABC-D))$. Just multiplying gives \begin{eqnarray} (ABC-D)^T(ABC-D)) &=& (C^TB^TA^T-D^T)(ABC-D) \\ &=& C^TB^TA^T ABC-D^T ABC-C^TB^TA^TD+D^TD \end{eqnarray} Letting $Q(B) = C^TB^TA^T ABC-D^T ABC-C^TB^TA^TD+D^TD$, we see that $$Q(B+\Delta) = Q(B) + C^T\Delta^TA^T ABC+ C^TB^TA^T A\Delta C-D^T A\Delta C-C^T\Delta^TA^TD + C^T\Delta^TA^T A\Delta C $$ and so $DQ(B)(\Delta) = C^T\Delta^TA^T ABC+ C^TB^TA^T A\Delta C-D^T A\Delta C-C^T\Delta^TA^TD$.

Since $\operatorname{tr}$ is linear, it follows that $Df(B)(\Delta) = \operatorname{tr}(DQ(B)(\Delta))$.

This gives \begin{eqnarray} Df(B)(\Delta) &=&\operatorname{tr} ( C^T\Delta^TA^T ABC+ C^TB^TA^T A\Delta C-D^T A\Delta C-C^T\Delta^TA^TD ) \\ &=& \operatorname{tr} ( C^T\Delta^TA^T ABC) + \operatorname{tr}(C^TB^TA^T A\Delta C) - \operatorname{tr}(D^T A\Delta C)-\operatorname{tr}(C^T\Delta^TA^TD ) \\ &=& \operatorname{tr} ( C^TB^TA^TA \Delta C) + \operatorname{tr}(C C^TB^TA^T A\Delta ) - \operatorname{tr}(CD^T A\Delta)-\operatorname{tr}(D^TA \Delta C ) \\ &=& \operatorname{tr} (C C^TB^TA^TA \Delta ) + \operatorname{tr}(C C^TB^TA^T A\Delta ) - \operatorname{tr}(CD^T A\Delta)-\operatorname{tr}(C D^TA \Delta ) \\ &=& 2(\operatorname{tr} (C C^TB^TA^TA \Delta ) -\operatorname{tr}(C D^TA \Delta ) ) \\ &=& 2\operatorname{tr} (C C^TB^TA^TA \Delta -C D^TA \Delta ) \\ &=& \operatorname{tr} (2C (C^TB^TA^T - D^T ) A \Delta ) \\ &=& \operatorname{tr} (2( A^T (ABC -D ) C^T )^T \Delta ) \\ \end{eqnarray}

So, in terms of the original problem, we have $Df(B)(\Delta) = \operatorname{tr} (2( X^T A^T (AXBX^TC -D ) C^T X )^T \Delta )$.

Now we note that if $\phi(\Delta) = \operatorname{tr} (F^T \Delta) $, then $\phi(E_{ij}) = [F]_{ij}$ (where $E_{ij}$ is the matrix of zeros except with a one at the $i,j$ position).

It follows that $\frac{\partial f(B)}{\partial B_{ij}} = Df(B)(E_{ij}) = [ 2( X^T A^T (AXBX^TC -D ) C^T X ]_{ij}$.

Note 1: It is not quite correct to say that $Df(B) = 2 X^T A^T (AXBX^TC -D ) C^T X$, as $Df(B): \mathbb{R}^{n \times n} \to \mathbb{R}$, whereas $(2 X^T A^T (AXBX^TC -D ) C^T X): \mathbb{R}^{n \times n} \to \mathbb{R}^{n \times n}$.

However, with the inner product $\langle X, Y \rangle_F = \operatorname{tr}(X^T Y)$, we see that $Df(B)(\Delta) = \langle 2 X^T A^T (AXBX^TC -D ) C^T X, \Delta \rangle_F$, so with this inner product, we can write $\nabla f(B) = 2 X^T A^T (AXBX^TC -D ) C^T X$.

Note 2: To compute the derivative/gradient, you need only compute the matrix $2( X^T A^T (AXBX^TC -D ) C^T X$ once.