Consider the following function:
$$ f(T) = \| T^{T}TB - C\|^2_2 $$
where $T, B, $ and $C$ are all complex matrices. Let $T = X + iY.$ I wish to compute $\nabla f$ i.e. $\dfrac{\partial f}{\partial T}$. According to the Matrix Cookbook, for a complex matrix $$\dfrac{\partial f}{\partial T} = \dfrac{\partial f}{\partial Re(T)} + i\dfrac{\partial f}{\partial Im(T)} \equiv \dfrac{\partial f}{\partial X} + i\dfrac{\partial f}{\partial Y}$$
Expanding $f$ in terms of $X$ and $Y$,
$$ f(X+iY)= \| (X + iY)^T(X + iY)B - C\|^2_2 = \| (X^TX -Y^TY+i(X^TY + Y^TX))B - C\|^2_2$$
The Matrix Cookbook also gives that
$$\frac{\partial g(U)}{\partial X} = \frac{Tr((\frac{\partial g(U)}{\partial U})^T \partial U)}{\partial X} + \frac{Tr((\frac{\partial g(U)}{\partial U^*})^T \partial U^*)}{\partial X}$$
My confusion is in the notation, where $\partial U$ is multiplied by the trace -- I cannot seem to understand what this is supposed to mean. Is it equivalent to $\frac{Tr((\frac{\partial g(U)}{\partial U})^T)\partial U}{\partial X}?$
If so, $U = (X + iY)^T(X + iY)B - C$ and $g(U) = \| U\|^2_2$. Then, $\dfrac{\partial g(U)}{\partial U} = 2U$, $\dfrac{\partial U}{\partial X}= B^T(2X + i(Y + Y^T))$, $\dfrac{\partial g(U)}{\partial U^*} = 2U^T$, and $\dfrac{\partial U^*}{\partial X}= B^T(2X - i(Y + Y^T))$.
Therefore, $\dfrac{\partial f}{\partial X} = Tr(2((X + iY)^T(X + iY)B - C))(B^T(2X + i(Y + Y^T))) + Tr(2((X + iY)^T(X + iY)B - C))(B^T(2X - i(Y + Y^T)))$.
Similarly for $Y$, $\dfrac{\partial g(U)}{\partial U} = 2U$, $\dfrac{\partial U}{\partial Y}= B^T(-2Y + i(X + X^T))$, $\dfrac{\partial g(U)}{\partial U^*} = 2U^T$, and $\dfrac{\partial U^*}{\partial Y}= B^T(-2Y - i(X + X^T))$.
Therefore, $\dfrac{\partial f}{\partial Y} = Tr(2((X + iY)^T(X + iY)B - C))(B^T(-2Y + i(X + X^T))) + Tr(2((X + iY)^T(X + iY)B - C))(B^T(-2Y - i(X + X^T)))$.
Would the gradient computed with this method be applicable for a gradient descent algorithm (like ISTA) for $T$?
The Wirtinger gradient of the objective function wrt $T$ is $$\eqalign{ \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \grad fT = G \;=\; TB(T^TTB-C)^H\,+\,T(T^TTB-C)^*B^T \\ }$$ which can be used to calculate the gradient wrt the real and imaginary parts of $T$ $$\eqalign{ T &= X+iY \\ dT &= dX+i\,dY \qiq dT^* = dX-i\,dY \\ df &= (G:dT) \;+\; (G^*:dT^*) \\ &= (G:dX + iG:dY) \;+\; (G^*:dX - iG^*:dY) \\ &= (G+G^*):dX \quad+\quad i\,(G-G^*):dY \\ &= 2\,{\sf Real}(G):dX \quad-\quad 2\,{\sf Imag}(G):dY \\ \grad fX &= 2\,{\sf Real}(G),\qquad\grad fY = -2\,{\sf Imag}(G) \\ }$$ This result is consistent with the canonical Wirtinger definition, i.e. $$\eqalign{ \grad fT &= \frac12 \left( \grad fX - i\,\grad fY \right) \;=\; {\sf Real}(G) + i\,{\sf Imag}(G) \;\equiv\; G }$$ While it's nice to know how to do such calculations correctly, for gradient descent iterations you really only need the gradient wrt $T$ $${ T_{k+1} \,=\, T_k - \lambda_k G_k }$$