Deriving the gradient of a multivariable function taking matrices as input?

98 Views Asked by At

I've been tasked with deriving the gradient for the following non-convex function and I'd like to verify whether my solution / progress is correct so far (I'm not too confident with calculus):

$$F(U, V) = \dfrac{1}{2}\sum_{(i,j)\in{\Omega}}(M_{ij}-u_iv_{j}^T)^2 + \frac{\lambda}{2}(||U||^2_F + ||V||^2_F)$$

Where $M, U, V$ are matrices, $u_i$ and $v_j$ refer to the $i$th row and $j$th row of $U$ and $V$ respectively, and $\lambda$ is a variable denoting some learning rate.

I've obtained the following so far for the partial derivative with respect to $U$:

$$\dfrac{\partial F(U,V)}{\partial U} =-\sum_{(i,j)\in{\Omega}}(M_{ij}-u_iv_{j}^T)^2 + \lambda(||U||^2_F)$$

But I'm a bit unsure about taking the derivative for $\frac{\lambda}{2}(||U||^2_F + ||V||^2_F)$. I think what I have above is correct as $V$ would be treated like a constant (since we're doing the partial derivation with respect to $U$). I assume the approach for taking the partial derivative with respect to $V$ would be the same.

2

There are 2 best solutions below

0
On BEST ANSWER

The tricky part here is computing $$ \frac{\partial}{\partial U}\sum_{(i,j)\in{\Omega}}(M_{ij}-u_iv_{j}^T)^2. $$ One approach is to go entry by entry. Note that the $p,q$ entry of this partial derivative is equal to the partial derivative of the sum with respect to the $p,q$ entry of $U$. We note that $$ \begin{align} \frac{\partial}{\partial U_{pq}} \sum_{(i,j)\in{\Omega}}(M_{ij}-u_iv_{j}^T)^2 &= \sum_{(i,j)\in{\Omega}} \frac{\partial}{\partial U_{pq}}(M_{ij}-u_iv_{j}^T)^2 \\ & = \sum_{j \in S_p} \frac{\partial}{\partial U_{pq}}(M_{pj}-u_pv_{j}^T)^2 \\ & = \sum_{j \in S_p} 2(M_{pj}-u_pv_{j}^T)\frac{\partial}{\partial U_{pq}}(M_{pj}-u_pv_{j}^T) \\ & = \sum_{j \in S_p} 2(M_{pj}-u_pv_{j}^T)(-V_{jq}) = -2\sum_{j \in S_p} (M_{pj}-u_pv_{j}^T)V_{jq}. \end{align} $$ Here, $S_p = \{j: (p,j) \in \Omega\}$.


If the known entries form a submatrix, then let $J_1$ denote the matrix whose rows are the rows of the identity matrix corresponding to the selected rows of $M$, and let $J_2$ denote the matrix whose columns are the columns are the columns of the identity matrix corresponding to the selected columns. We can then write the sum as $$ \sum_{(i,j)\in{\Omega}}(M_{ij}-u_iv_{j}^T)^2 = \|J_1(M - UV^T)J_2\|_F^2. $$ We compute $$ \begin{align} \|J_1(M - &[U+H]V^T)J_2\|_F^2 = \operatorname{tr}([J_1(M - [U+H]V^T)J_2][J_1(M - [U+H]V^T)J_2]^T) \\ & = \|J_1(M - [U+H]V^T)J_2\|_F^2 -2 \operatorname{tr}([J_1HV^TJ_2][J_1MJ_2]^T) + o(\|H\|) \\ & = \|J_1(M - [U+H]V^T)J_2\|_F^2 -2 \operatorname{tr}([V^TJ_2][J_1MJ_2]^TJ_1H) + o(\|H\|) \\ & = \|J_1(M - [U+H]V^T)J_2\|_F^2 + \operatorname{tr}([-2J_1^TJ_1MJ_2J_2^TV]^TH) + o(\|H\|). \end{align} $$ Correspondingly, we compute $$ \frac{\partial}{\partial U}\|J_1(M - [U+H]V^T)J_2\|_F^2 = -2 J_1^TJ_1MJ_2J_2^TV. $$ To make some sense of this: we can write this product in the form $-2 J_1^T[J_1MJ_2][J_2^TV]$. $[J_1MJ_2]$ is the known submatrix of $M$, and $V$ is the submatrix of $V$ obtained by taking only the rows of $V$ corresponding to known columns of $M$.

0
On

$\def\p#1#2{\frac{\partial #1}{\partial #2}}$ Capture the index constraint using the Hadamard product $(\odot)$ and a projection matrix $$\eqalign{ P_{ik} &= \begin{cases} 1\quad{\rm if\,}(i,k)\in\Omega \\ 0\quad{\rm otherwise} \end{cases} \\ P &= P\odot P \qquad\; ({\rm projective\,property}) \\ \sum_{(i,k)\in\Omega}\!\!M_{ik} &= P\odot M \qquad ({\rm same\,size\,as\,}M) \\ }$$ Denote the standard vector basis as {$e_k$} and the standard matrix basis as $\big\{E_{ik} \doteq e_ie_k^T\big\}\;$ and the trace/Frobenius product using a colon, i.e. $$A:B \;=\; \sum_{i=1}^m\sum_{k=1}^nA_{ik}B_{ik} \;=\; {\rm Tr}(A^TB)$$ For typing convenience, define the indexed matrix $$A_{ik} \;=\; P\odot\Big(Ue_i(Ve_k)^T-M\Big) \;=\; P\odot\Big(UE_{ik}V^T-M\Big)$$ Assembling all of this with the Einstein summation convention allows us to write the objective function and calculate its differential. $$\eqalign{ F &= \tfrac 12A_{ik}:A_{ik} + \tfrac \lambda 2\big(U:U + V:V\big) \\ dF &= A_{ik}:dA_{ik} + \lambda\big(U:dU + V:dV\big) \\ &= A_{ik}:\big(dU\,E_{ik}V^T+UE_{ik}\,dV^T\big) +\lambda\big(U:dU+V:dV\big) \\ &= (VE_{ik}^TA_{ik}+\lambda U):dU + (A_{ik}^TUE_{ik}+\lambda V):dV \\ }$$ Holding $V$ constant (i.e. setting $\,dV=0$) yields the gradient with respect to $U$, while holding $U$ constant yields the gradient with respect to $V$ $$\eqalign{ \p{F}{U} &= VE_{ik}^TA_{ik}+\lambda U, \qquad \p{F}{V} &= A_{ik}^TUE_{ik}+\lambda V \\ }$$


A quick note about the Frobenius and Hadamard products. They are mutually commutative, i.e. $$\eqalign{ A:B &= B:A \\ A\odot B &= B\odot A \\ C:A\odot B &= C\odot A:B \\ }$$ In addition, the properties of the trace allow the terms in a Frobenius product to be rearranged in a number of equivalent ways, e.g. $$\eqalign{ A:B &= A^T:B^T \\ CA:B &= A:C^TB = C:BA^T \\ }$$ The thing to keep in mind in these manipulations is that the matrix on each side of a Frobenius (or Hadamard) product must have exactly the same dimensions.