Let A,B,C,D,X be matrices. I'd like to perform a Gradient Descent minimization to the loss functin $$ tr[(AXBX^TC-D)^T(AXBX^TC-D)] $$
My question is, how to take the gradient efficiently w.r.t. $B$? I have came up with the following: $$ \frac {\partial} {B_{ij}} tr[(AXBX^TC-D)^T(AXBX^TC-D)] =$$ $$ tr[(\frac {\partial} {B_{ij}}(AXBX^TC-D)^T)(AXBX^TC-D)]+$$ $$tr[(AXBX^TC-D)^T\frac {\partial} {B_{ij}}(AXBX^TC-D)] =$$ recalling that $trX^T=trX$: $$ 2tr[(\frac {\partial} {B_{ij}}(AXBX^TC-D)^T)(AXBX^TC-D)]=$$ $$ 2tr[((AX(\frac {\partial} {B_{ij}}B)X^TC)^T)(AXBX^TC-D)]=$$ $$ 2tr[((AXJ^{ij}X^TC)^T)(AXBX^TC-D)]$$ where $J^{ij}$ is all zero except that the $i,j$ element is 1. The problem is that I need to have an expression for whole $B$ rather than only $B_{ij}$ so I won't need to do all this multiplication as the amount of $B$'s elements when I update $B$ by Gradient Descent.
I have read many material (including the matrix cookbook) and still very confused.
Thanks for your help!!!
You can always absorb $X$ into $A$ and $X^\top$ into $C$, so that $X$ is eliminated in your objective function. Now, using the cyclic property of the trace of matrix product, you can rewrite the objective function as \begin{align*} f(B)&=\operatorname{tr}\left((ABC-D)^\top(ABC-D)\right)\\ &=\operatorname{tr}\left(B^\top A^\top ABCC^\top\right) - 2\operatorname{tr}\left(B^\top A^\top DC^\top\right) +\operatorname{tr}\left(D^\top D\right)\tag{1} \end{align*} In general, if $M$ is some constant matrix, the derivative of $\operatorname{tr}(B^\top M)$ with respect to $B$ is $M$. Using the product rule to handle the first summand in $(1)$, we see that the derivative of $f(B)$ with respect to $B$ is $2(A^\top ABCC^\top - A^\top DC^\top)$ for the new $A$ and $C$. So, with the original $A$ and $C$, the derivative should be $$2(X^\top A^\top AXBX^\top CC^\top X - X^\top A^\top DC^\top X).$$