Minimizing the Frobenius norm of a matrix involving the Hadamard product, $\|X(A\odot Y)-S\|_F$

927 Views Asked by At

Let $S\in\mathbb{R}^{L\times N}$ and $A\in\mathbb{R}^{M\times N}$ be known and arbitrary. I'd like to solve the following system: \begin{align} \min_{X\in\mathbb{R}^{L\times M},Y\in\mathbb{R}^{M\times N}} \frac{1}{2}\|X(A\odot Y)-S\|_F^2, \end{align} where $\odot$ is the Hadamard product.

I've proceeded by defining $Z=X(A\odot Y)-S$ and then rewriting the objective function by $f=\frac{1}{2}Z:Z$, where ($:$) is the Frobenius product. Differentiating, we have \begin{align} \text{d}f&=Z:dZ\\ &=Z:\text{d}[X(A\odot Y)-S]\\ &=Z:[X\text{d}(A\odot Y)+\text{d}X(A\odot Y)-\underbrace{\text{d}S}_{=0}]\\ &=Z:[X(A\odot \text{d}Y+\underbrace{\text{d}A}_{=0}\odot Y)+\text{d}X(A\odot Y)]\\ &=Z:[X(A\odot \text{d}Y)+\text{d}X(A\odot Y)]\\ &=[X(A\odot Y)-S]:[X(A\odot \text{d}Y)+\text{d}X(A\odot Y)]\\ &=[X(A\odot Y)-S]:[X(A\odot \text{d}Y)]+[X(A\odot Y)-S]:[\text{d}X(A\odot Y)]\\ &=[X(A\odot Y)]:[X(A\odot \text{d}Y)]+[X(A\odot Y)]:[\text{d}X(A\odot Y)]\\ &\qquad-S:[X(A\odot \text{d}Y)]-S:[\text{d}X(A\odot Y)] \end{align}

This is as far as I have been able to get. I know that I need to find the gradient of $f$ with respect to $X$ and $Y$ and then try to find a pair $(X^*,Y^*)$ that satisfies the two first-order conditions. I'm also aware that such a pair may not be unique (e.g., for any $c\in\mathbb{R}\not\cap\{0\}$, the pair $(cX^*,c^{-1}Y^*)$ is also a solution to this system), but I'm willing to overlook this for now, as I want to first characterize what a solution set would look like. I've done a search, but haven't been able to figure out whether operations of the form $X(A\odot Y)$ simplify nicely, so I'm unsure how to proceed from here.

2

There are 2 best solutions below

0
On

@ A.P. , your two equalities are correct. The first one is clear $(X(A\circ Y)-S)(A^T\circ Y^T)=0$ but the second one is less clear. Indeed, the rule for rearranging a Hadamard-Frobenius combination is absolutely not obvious. We can reason as follows:

For every $K\in M_{M,N}$, $\dfrac{\partial f}{\partial Y}:K\rightarrow tr((X(A\circ Y)-S)(A^T\circ K^T)X^T)=0$ , that is $tr(X^T(X(A\circ Y)-S)(A^T\circ K^T))=0$.

Let $U=\{(i,j)|a_{ij}=0\}$; then $(A^T\circ K^T)_{ji}=0$ when $(i,j)\in U$ and, when $K$ varies, the indices $(j,i)$ s.t. $(i,j)\notin U$ are free. It is not difficult to see that $(X^T(X(A\circ Y)-S))_{k,l}=0$ when $(k,l)\notin U$ and is free when $(k,l)\in U$. That is equivalent to $A\circ (X^T(X(A\circ Y)-S))=0$.

Note that if $A$ has no zero entries, then to find $Y$ is equivalent to find $A\circ Y$; then we put $T=A\circ Y$ and the problem reduces to find $\inf_{X,T}||XT-S||^2$. It is equivalent to search the best approximation of $S$ by a matrix with rank $\inf(L,M,N)$; the standard method is to calculate the SVD of $S$. Thus, I am not sure that the gradient method is the best one.

When $A$ has zero entries, then the problem reduces to find $\inf_{X,T}||XT-S||^2$ under the condition $t_{ij}=0$ when $(i,j)\in U$.

2
On

$ \def\LR#1{\left(#1\right)} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\bp{{\boldsymbol +}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} $In addition to $Z,\,$ it will prove convenient to define the matrix $W$ as $$\eqalign{ W &= A\odot Y \qiq Z=XW-S \\ }$$ Set the two gradients (that you calculated in the comments) to zero, and use pseudoinverses to solve them independently $$\eqalign{ \grad{f}{X} &= ZW^T=\LR{XWW^T-SW^T}\doteq 0&\qiq X=SW^\bp \\ \grad{f}{Y} &= A\odot(X^TZ)=A\odot\LR{X^TXW-X^TS}\doteq 0&\qiq W=X^\bp S \\ }$$ Alternating Least Squares (ALS) can be used to solve for the optimal $(X,W)$ matrices $$\eqalign{ X_{k+1} = SW_k^\bp ,\qquad W_{k+1} = X_{k+1}^\bp S ,\qquad k = {k+1} \\ }$$ Using random $(A,S)$ matrices of various $(L,M,N)$ dimensions and random initializations for $(X_0,Y_0)\;$ I observed the following behavior:


$\qquad$If $\,M\ge\min(L,N)\,$ then the iteration converges in a single step.
$\qquad$In this case, $f\approx 0$ and both gradients $\approx 0$

$\qquad$If $\,M<\min(L,N)\,$ then the iteration converges after a few dozen steps.
$\qquad$In this case, $f>0$ but both gradients $\approx 0$


Like Non-negative Matrix Factorization (NMF), the factors $(X,W)$ are not unique. Each $(X_0,W_0)$ leads to a different minimum and finding the global minimum is NP-hard.


This similarity to NMF suggests another idea, i.e. Lee-Seung iteration $$\eqalign{ X_+ = X\odot\fracLR{SW^T}{XWW^T} ,\qquad W_+ = W\odot\fracLR{X^TS}{X^TXW} }$$ This iteration uses element-wise division and avoids calculating the pseudoinverse at each step. Each step is inexpensive, but it converges very slowly, often requiring thousands of iterations.

However, Lee-Seung iteration does confer control over the structure of the $(X,W)$ factors, i.e. any zero element of $X_0$ will create a corresponding zero element in the final $X$ factor. This can be used to generate upper/lower triangular factors.

Element-wise positivity (or negativity) of the factors can be controlled in a similar fashion. Likewise column (or row) normalization constraints can be imposed on one of the factors at each iteration.