Minimizing $||A - (M^2) (N^2)||_F^2$ vs NMF

103 Views Asked by At

A reasonable way to factor a non-negative matrix $A$ into two non-negative matrices $M$ and $N$ can be done by minimizing the squared Frobenius norm, $||A - (M^2) (N^2)||_F^2$, where $M^2$ is the operation of squaring each entry of the matrix $M$ element wise (likewise with $N^2$). The final factorization of $A$ is given by $A = WH$, where $W = M^2$ and $H = N^2$, and squaring is done element-wise as before.

I tried to search for information about this method, but I could not find anything. I am specifically interested in how this method relates to non-negative matrix factorization (NMF), and the pros and cons of factorizing a matrix by the method described above over using NMF.

1

There are 1 best solutions below

0
On

To be concrete, let's specify the matrix dimensions and the objective function as $$\eqalign{ A&\in{\mathbb R}^{m\times n},\quad W\in{\mathbb R}^{m\times r},\quad H\in{\mathbb R}^{n\times r} \cr B &= WH^T-A \cr \phi &= \tfrac{1}{2}\|B\|_F^2 = \tfrac{1}{2}B:B \cr }$$ where a colon is used to denote the matrix inner product. Note that this formulation of the problem has implicit rank-constraints baked into it, i.e. $$\eqalign{ {\rm rank}(W) &\leq \min(m,r) \cr {\rm rank}(H) &\leq \min(n,r) \cr }$$ It will prove useful to combine the unknown matrices into a single matrix variable $$\eqalign{ X &= \pmatrix{W\\H} \in {\mathbb R}^{k\times r},\quad k = m+n\cr }$$ and it will be useful to have a way to extract the constituent matrices from $X$ $$\eqalign{ E_w&=\pmatrix{I_{mm}\\0_{nm}},\quad &E_h=\pmatrix{0_{mn}\\I_{nn}}\quad &I_{kk} = E_wE_w^T+E_hE_h^T \cr W &= E_w^TX,\quad &H = E_h^TX,\quad &B = E_w^TXX^TE_h - A \cr &&&C = E_wBE_h^T \cr }$$ Next, calculate the differential and gradient of the objective function. $$\eqalign{ \phi &= \tfrac{1}{2}B:B \cr d\phi &= B:dB \cr &= B:E_w^T\,d(XX^T)\,E_h \cr &= E_wBE_h^T:d(XX^T) \cr &= C:d(XX^T) \cr &= (CX+C^TX):dX \cr G = \frac{\partial\phi}{\partial X} &= CX + C^TX \cr }$$ Yielding $G$ as the gradient in the absence of non-negativity constraints.

Let's use a lowercase letter to denote an unconstrained matrix, and an uppercase letter for the corresponding non-negative matrix. Then as you suggested, let's use the elementwise product (denoted by $\odot$) to enforce the constraint. $$\eqalign{ W &= w\odot w, \quad H = h\odot h, \quad X = x\odot x,\quad x = \pmatrix{w\\h} \cr d\phi &= G:dX \cr&= G:(2x\odot dx) \cr&= (2G\odot x):dx \cr g = \frac{\partial\phi}{\partial x} &= 2G\odot x \cr }$$ So $g$ is the gradient, in the presence of the non-negativity constraints.

Note that the problem has now been re-formulated as an unconstrained minimization in terms of a single variable, so you can use your favorite method (CG, L-BFGS, etc) to solve for $x$. Then you can recover $X$ from $x$, and finally the individual $(W,H)$ matrices can be recovered from $X$.

It's probably not a convex problem, so you should try several random starting points for $x$ and let the algorithm run for a few hundred iterations from each point.

The numerical convergence can also be improved by ignoring the missing elements of $A$ and by introducing some kind of regularization into the problem. This will slightly alter some of the above formulas, e.g. $$\eqalign{ B &= P\odot(WH^T-A) \cr \phi &= \tfrac{1}{2}\big(B:B + \lambda X:X\big) \cr G &= CX + C^TX + \lambda X \cr }$$ where $\lambda$ is a small constant and $P$ is a matrix with elements equal to $0$ when the corresponding element of $A$ is missing and equal to $1$ otherwise.