How to Solve Large Scale Matrix Least Squares with Frobenius Regularization Problem efficiently?

610 Views Asked by At

How to solve the following minimization problem: $$\min_{S>0}{F(\mathbf{S}) }= \frac{1}{2}\Vert \mathbf{M} - \mathbf{K_2SK_1^T}\Vert _F^2+\frac{1}{20}\Vert\mathbf{S}\Vert_F^2$$ where $\mathbf{S}\in R^{256 \times 256}$ with nonegative elements, $\mathbf{M}\in R^{n \times m}$, $\mathbf{K_2} \in R^{n \times 256}$, $\mathbf{K_1} \in R^{m \times 256}$. In most cases $3500\lt m \lt 18000$, $8 \lt n \lt 128$.

The data of a minimal case can be downloaded here. In this case $m=3788$, $n=16$. The following code help to load the data into workspace:

MATLAB

load('problem.mat')

Python
import scipy.io
data = scipy.io.loadmat('/home/ubuntu/MATLAB/problem.mat')
K1 = data['K1']
K2 = data['K2']
M = data['M']
S_inital_guess = data['S00']

What I've tried

  1. Vectorize the problem using $\mathbf{K}=kron(\mathbf{K_2},\mathbf{K_1})$. But $\mathbf{K}$ is too large for ordinary PC. And any optimization strategy using hessian matrix would produce more larger matrice.

  2. Solving the matrix-form problem directly which produce a 4-order Hessian tesnsor. Without hession, the algorithm(steepest descent with exact/inexact line search) converges too slowly.

  3. CVXPY - out of memory

    n = 256

    X = cp.Variable((n,n))

    constraints = [X>=0]

    gamma = cp.Parameter(nonneg=True, value=1)

    obj = cp.Minimize(cp.norm(K2 @ X @ K1.transpose() - M,'fro') + gamma*cp.norm(X,'fro')**2)

    prob = cp.Problem(obj,constraints)

    prob.solve(verbose=True)

How to solve it?

How to solve this large scale minimization problem efficiently? Could you please give me some code (python or matlab) snippet to solve the attach problem? Are there any out-of-box toolboxes I could use?

For Algorithm Testing

I've added a new mat file containing $K_1$,$K_2$,$M$ and a right answer $Xtrue$ for testing. All matrix are much smaller than the original problem in this file.

2

There are 2 best solutions below

4
On BEST ANSWER

Here is a simple Julia script. If you translate it to another language beware of the nested loops. Julia handles these efficiently but they should be vectorized for Matlab or Python.

The first time the script is run it will create tab-separated-values (TSV) files for the $X$ and $W$ matrices. On subsequent runs, the script will read the TSV files, execute $k_{max}$ iterations, update the TSV files, and exit.

Thus you can intermittently refine the solution until you run out of patience.

#!/usr/bin/env  julia

#  Sequential Coordinate-wise algorithm for Non-Negative Least-Squares
#  as described on pages 10-11 of
#     http://users.wfu.edu/plemmons/papers/nonneg.pdf
#
#  Convergence is painfully slow, but unlike most other NNLS
#  algorithms the objective function is reduced at each step.
#
#  The algorithm described in the PDF was modified from its
#  original vector form:  |Ax - b|²
#    to the matrix form:  |LXKᵀ - M|²  +  λ|X|²
#
#  and to include the regularization term.

using LinearAlgebra, MAT, DelimitedFiles

function main()
  matfile = "problem.mat"
  Xfile   = "problem.mat.X.tsv"
  Wfile   = "problem.mat.W.tsv"

# read the matrices from the Matlab file
  f = matopen(matfile)
    K = read(f,"K1"); println("K: size = $(size(K)),\t rank = $(rank(K))")
    L = read(f,"K2"); println("L: size = $(size(L)),\t rank = $(rank(L))")
    M = read(f, "M"); println("M: size = $(size(M)),\t rank = $(rank(M))")
  # S = read(f,"S00");println("S: size = $(size(S)),\t rank = $(rank(S))")
  close(f)

  A = L'L
  B = K'K
  C = -L'M*K
  m,n = size(C)
  λ = 1/10     # regularization parameter
  kmax = 100   # maximum iterations


# specify the size of the work arrays
  X = 0*C
  W = 1*C
  H = A[:,1] * B[:,1]'

# resume from latest saved state ... or reset to initial conditions
  try
     X = readdlm(Xfile);  println("X: size = $(size(X)), extrema = $(extrema(X))")
     W = readdlm(Wfile);  println("W: size = $(size(W)), extrema = $(extrema(W))")
     println()
  catch
     @warn "Could not read the saved X,W matrices; re-initializing."
     X = 0*C
     W = 1*C
  end

  fxn = (norm(L*X*K' - M)^2 + λ*norm(X)^2) / 2
  println("at step 0, fxn = $fxn")

  k = 0
  while k < kmax
     for i = 1:m
         for j = 1:n
             mul!(H, A[:,i], B[:,j]')
             H[i,j] += λ
             δ = min( X[i,j], W[i,j]/H[i,j] )
             X[i,j] -= δ
             H .*= δ
             W .-= H
         end
     end
     k += 1
     fx2 = (norm(L*X*K' - M)^2 + λ*norm(X)^2) / 2
     println("after step $k, fxn = $fx2")

     # convergence check
     if fx2 ≈ fxn; break; end
     fxn = fx2
  end

# save the current state for the next run
  writedlm(Xfile, X)
  writedlm(Wfile, W)

# peek at the current solution
  println("\nsummary of current solution")
  println(" vector(X) = $(X[1:4]) ... $(X[end-3:end])")
  println("extrema(X) = $(extrema(X))")
end

# invoke the main function                                           
main()
1
On

You can use the projected gradient method, or an accelerated projected gradient method such as FISTA. It's not too hard to implement these yourself.

We could vectorize $S$ but it's more elegant to work directly in the vector space $V$ of $256 \times 256$ matrices with entries in $\mathbb R$. We'll need to know the gradient of your function $F$.

The gradient of the function $h(S) = \frac{1}{20} \| S \|_F^2$ is $$ \nabla h(S) = \frac{1}{10} S. $$

The gradient of the function $g(S) = \frac12 \| M - K_2 S K_1^T \|_F^2$ requires a bit more effort. Let $A$ be the linear transformation defined by $$ A(S) = K_2 S K_1^T. $$ Then $$\nabla g(S) = A^*(A(S) - M) $$ where $A^*$ is the adjoint of $A$. If we can figure out what the adjoint of $A$ is, we'll be done.

The defining property of $A^*$ is $$ \tag{1} \langle A(S), U \rangle = \langle S, A^*(U) \rangle $$ for all $S, U$. But note that, from the definition of the Frobenius inner product, we have \begin{align} \langle A(S), U \rangle &= \text{Tr}((K_2 S K_1^T)^T U) \\ &= \text{Tr}(K_1 S^T K_2^T U) \\ &= \text{Tr}(S^T K_2^T U K_1 ) \qquad (\text{because Tr}(XY) = \text{Tr}(YX) )\\ &= \langle S, K_2^T U K_1 \rangle \end{align} Comparing this with (1), we see that $$ A^*(U) = K_2^T U K_1. $$

So now we're ready to minimize your function $F$ using the projected gradient iteration, which is $$ S^{k+1} = \max(S^k - t \nabla F(S^k), 0) $$ for $k = 0, 1, \ldots$.

You only need to modify a couple lines of code to implement an accelerated projected gradient method (such as FISTA), which will probably converge dramatically faster.