In my studies of applied mathematics, specifically optimization and applied linear algebra, I have come across the following expression which I need help differentiating:
$ z(B,C) = \lVert Y-B \phi(CX) \rVert_F ^2 = \text{trace} \left( (Y-B \phi(CX))(Y-B \phi(CX))^T \right) $
where $ Y \in \mathbb{R}^{m\times s},X\in \mathbb{R}^{n\times s} $ are two constant real matrices, $ B \in \mathbb{R} ^{m\times k} $ and $ C\in\mathbb{R}^{k \times n} $ are the two variable matrices of the function $ z(B,C) $ defined above, $ \lVert \bullet \rVert_F $ is the Frobenius norm and all matrix dimensions involved are constant and non-variable.
The function $ \phi : \mathbb{R} \to \mathbb{R} $ is a nonlinear function defined for real numbers as follows:
$ \phi(x) = \left\{ \begin{array}{ll} 0 & x \leq 0 \\ x & x > 0 \\ \end{array} \right. $
and we extend its definition to matrices element-wise, that is $ (\phi(A))_{i,j} = \phi((A)_{i,j}) $.
I would like to find the matrix derivative of the function $ z $, as stated above, with respect to $ C $, that is $ \frac{\partial z}{\partial C}$.
Differentiating with respect to $ B $ is easy enough as $ C $ doesn't impede applying standard formulas, but my problem is with differentiating with respect to $ C $ as it is the argument of a nonlinear function and I do not know how to derive it with respect to matrices.
I was hoping someone could please come to the rescue and help me differentiate the function $ z $ with respect to $ C $. I thank all helpers.
Let's use a convention where uppercase Latin letters represent matrices, lowercase Latin vectors, and Greek letters are scalars.
The function you've denoted by $\phi$ is the ReLU function, $r(\alpha)$, whose derivative is the Heaviside step function $$h(\alpha) = \frac{dr(\alpha)}{d\alpha} \implies dr = h\,d\alpha$$ Applying these scalar functions element-wise on a matrix argument $A=CX,$ produces matrix results, which we'll denote as $$\eqalign{ R &= r(A) \cr H &= h(A) \implies dR = H\odot dA = H\odot(dC\,X) \cr }$$ where $\odot$ is the elementwise/Hadamard product.
Define a new matrix variable $$M=BR-Y \implies dM = B\,dR + dB\,R$$ Write the function in terms of this new variable, then find its differential. $$\eqalign{ \lambda &= \|M\|^2_F = M:M \cr d\lambda &= 2M:dM \cr &= 2M:B\,dR + 2M:dB\,R \cr &= 2B^TM:dR + 2MR^T:dB \cr &= 2B^TM:H\odot(dC\,X) + 2MR^T:dB \cr &= 2(B^TM)\odot H:(dC\,X) + 2MR^T:dB \cr &= 2((B^TM)\odot H)X^T:dC + 2MR^T:dB \cr }$$ Setting $dB=0$ yields the gradient wrt $C$ $$\frac{\partial\lambda}{\partial C} = 2((B^TM)\odot H)X^T$$ And setting $dC=0$ yields the gradient wrt $B$ $$\frac{\partial\lambda}{\partial B} = 2MR^T$$
NB: Depending on your preferred layout convention, you may need to transpose these results.
Also, the colon notation used above (called the Frobenius product) is just a convenient way of writing the trace function, i.e. $\,\,A:B={\rm tr}(A^TB)$.
The cyclic property of the trace leads to several ways to rearrange the terms in a Frobenius product. For example, all of the following expressions are equivalent $$\eqalign{ A:BC &= A^T:(BC)^T \cr &= BC:A \cr &= AC^T:B \cr &= B^TA:C \cr }$$