Hessian for Matrix Factorization objective

327 Views Asked by At

Consider the objective for Matrix Factorization: $$f(W,H) = \frac 12 \|X - WH\|_F^2 \to \min,$$ where $X \in \mathbb R^{n \times m}$ is given and $W \in \mathbb R^{n \times k}$, $H \in \mathbb R^{k \times m}$ are required to find.

Two questions:

  1. Is there a simple representation of $\nabla^2 f$? It should be a matrix of size $(nk + km) \times (nk + km)$, which stores $\frac {\partial^2 f} {\partial W_{ij} \partial W_{kl}}$, $\frac {\partial^2 f} {\partial W_{ij} \partial H_{kl}}$, $\frac {\partial^2 f} {\partial H_{ij} \partial H_{kl}}$ for all $i,j,k,l$.

For gradients, there are simple formulae: \begin{align*} \nabla f_W &= -X H^\top + W H H^\top \\ \nabla f_H &= -W^\top X + W^\top W H, \end{align*} which are not hard to obtain by expanding the Frobenius norm using the inner product. For Hessian, I don't know one: I tried to do this, it's doable, but intermediate results are a complete mess.

  1. (What I really need) Is there a simple way to compute the smallest eigenvalue of $\nabla^2 f$? In the first place, I don't want to build $\nabla^2 f$ if possible, since it's a bit too large, and I'm not sure I'll be able to find the smallest eigenvalue quickly. At the very least, is there a way to work with a smaller matrix instead? It's also guaranteed that all elements of $X,W,H$ are nonnegative.
2

There are 2 best solutions below

2
On

From your formula, we can see that $$ \frac{\partial f}{\partial W_{ij}} = -X^i (H^j)^T + W^i H (H^j)^T,\\ \frac{\partial f}{\partial H_{ij}} = -(W_i)^TX_j + W_i^T W (H_j), $$ where $M^i$ denotes the $i$th row of $M$ and $M_i$ the $i$th column. Thus, we have $$ \frac{\partial^2 f}{\partial W_{kl}\partial W_{ij}} = \delta_{ik} H^l (H^j)^T. $$ The mixed partial is a bit trickier. We find that $$ \frac{\partial^2 f}{\partial H_{kl}\partial W_{ij}} = -\delta_{jl}X_{ij} + W_{ik}H_{lj} + \delta_{jk}W^iH_l, $$ Finally, we have $$ \frac{\partial^2 f}{\partial H_{kl}\partial H_{ij}} = \delta_{jl} W_i^TW_k. $$


With that, we can build $M = \nabla_W^2 f$: we have $$ M = \sum_{i,k = 1}^n\sum_{j,l = 1}^k (e_j \otimes e_i)(e_l \otimes e_k)^T\delta_{ik} H^l (H^j)^T \\ = \sum_{i = 1}^n\sum_{j,l = 1}^k (e_j \otimes e_i)(e_l^T \otimes e_i^T) H^l (H^j)^T \\ = \sum_{i = 1}^n\sum_{j,l = 1}^k (e_j \otimes e_i)(e_l^T \otimes e_i^T) e_l^TH H^T e_j \\ = \sum_{i = 1}^n\sum_{j,l = 1}^k (e_je_l^TH H^T e_je_l^T) \otimes (e_ie_i^T) \\ = \sum_{j,l = 1}^k ([H H^T]_{lj}e_je_l^T) \otimes I_n \\ = (HH^T) \otimes I_n. $$ I suspect that similar computations can be made for the remaining Hessian blocks.

2
On

For ease of typing, define $$A = (WH-X) \quad\implies\quad dA = W\,dH+dW\,H$$ and write the gradients more concisely as $$G_w = AH^T, \qquad G_h = W^TA$$ First, calculate the differentials of the gradients. $$\eqalign{ dG_w &= A\,dH^T + dA\,H^T \\ &= A\,dH^T + W\,dH\,H^T + dW\,HH^T \\ dG_h &= W^T\,dA+dW^T\,A \\ &= W^TW\,dH + W^TdW\,H + dW^T\,A \\ }$$ Next, apply the Kronecer-vec operation to these differential expressions. $$\eqalign{ dg_w &= {\rm vec}(dG_w) \\ &= \Big[(I_k\otimes A)C_h + (H\otimes W)\Big]dh + \Big[HH^T\otimes I_n\Big]dw \\ &= \Big[M_{wh}\Big]dh + \Big[M_{ww}\Big]dw \\ \\ dg_h &= {\rm vec}(dG_h) \\ &= \Big[I_m\otimes W^TW\Big]dh + \Big[(H^T\otimes W^T)+(A^T\otimes I_k)C_w\Big]dw \\ &= \Big[M_{hh}\Big]dh + \Big[M_{hw}\Big]dw \\ }$$ where $(C_h,C_w)$ are the commutation matrices associated with the Kronecker product, and $\big(M_{hh},\,M_{hw},\,M_{wh},\,M_{ww}\big)$ can be identified as the desired Hessian matrices.

Finally, you wished to assemble them into a single (block) matrix $$ M = \left[\begin{matrix} M_{ww} &M_{wh} \\ M_{hw} &M_{hh} \\ \end{matrix}\right] $$ It's worth pointing out that $M_{hw}^T=M_{wh}\,$ so the Hessian is symmetric (as it should be).