How to optimize an approximated matrix multiplication?

205 Views Asked by At

[UPDATING]

Here is a solution based on the suggestion of professor Bangerth. To describe what I am trying to do, first rewrite the question into $$\max_X\|(I-\alpha X)^{-1}X\circ AE\|_F^2$$ where $X,A$ still are $n$ by $n$ square matrices, $E$ is a vector with all the elements are one such that $E=(1,\cdots,1)^\top$. According to professor Bangerth's transformation, the objective becomes $$\max_B\left\|\frac{1}{\alpha}(B-I)\circ AE\right\|_F^2$$ where $B=(I-\alpha X)^{-1}$. Now the gradient of objective with respect to $B$ is $$\nabla_B\left(\frac{1}{\alpha^2}\text{Tr}(B-I)\circ(AEE^\top A^\top)\circ(B-I)^\top\right) =\frac{1}{\alpha^2}(B-I)\circ AE(I\circ AE)^\top $$ By the stochastic gradient descent (SGD) algorithm, $B$ follows $$B^{t+1} = B^t + \eta \nabla_B$$ So initialize $B$ as $B^0 = (I-\alpha X^0)^{-1}$ for a guess $X^0$, $B$ can keep iterating till convergence.

The constraint of $B$ is the specification of matrix $X$, which is a sparse matrix and each column has and only has one one. To update $X$ for each iteration, compute the objective, denoted by $J$, with the $B^t$ from SGD, find the location of maxima in each row of $$A + \alpha J$$ then mark the same location in $X$ as one, the rest keeps being zero, this called $X'$. In Python, this can be done by

ini  =  np.argmax(A + alpha * J, axis = 1)
X    =  np.zeros([n,n])
X[np.arange(n),ini]  =  1

The updated $B'$ can be obtained by $X'$ such that $B'=(I-\alpha X')^{-1}$. We can also put $B'$ into SGD to get $B^{t+1}$, these iterations stop when $\|B'-B^t\|<\epsilon$.

The whole thing can be implemented theoretically and pratically, I think. But I still have one pivotal question:

How to add the constraint to the objective or express it as a panelty function?


[OLD]

Suppose the objective I try to maximize is $$\max_{X} \|(I - \alpha X)^{-1}XA\|_F$$ where $X$ is the matrix needs to be pinned down, and $\alpha$ is a scalar. $(I-\alpha X)^{-1}$ is invertible only when $\alpha\neq1$, this has been proven in the following answer by obscurans. All matrices are $n$ by $n$ square matrices. An extra condition of matrix $X$ is that the vectors $X(:,k)$ for $k=1,\cdots,n$ is selected from identity by the optimization. So $X(:,k)$ is also a standard basis vector and $X$ may have repeated columns.

One possible way to obtain the maxima of the objective is to apply the matrix multiplication via random sampling, where the objective is rewritten as $$\max_{p_k}\left\| \frac{1}{n}\sum_{k=1}^n\frac{1}{p_k}B(:,k)X(k,:)A\right\|_F$$ where $B$ is the inverse and be approximated by Neumann series such that $$B = (I - \alpha X)^{-1} = \sum_{i=0}^n (\alpha X)^k$$

I find this method is too complex to implement on numerical calculation. Is there any better way to solve this puzzle? Thanks in advance.

1

There are 1 best solutions below

5
On

Now that I know what I'm looking for, there's a much easier proof of the below.

A stochastic matrix models a transition step of a Markov chain. Every state must transition into exactly one other state, probabilistically.

What this means is, the total probability mass must be conserved, and therefore $[1,1,\ldots,1]^T$ is always an eigenvector, of eigenvalue $1$.

But now $I-X$ necessarily has the same eigenvector, of eigenvalue zero, and is never invertible!


[Old] At the very least, you can compute $(I-X)^{-1}$ easily with a cycle decomposition. Note that a permutation matrix is one representation of the permutation group, and all of the group theoretic properties of permutations apply.

In particular, for any $X$, it is guaranteed that for $N=\mathrm{lcm}(1,2,\ldots,n)$, that $X^N=I$. The reason: trace the orbit of any particular object as you iterate the permutation. This cycle length must necessarily be between 1 and $n$. So if you run this many powers of the permutation, every single element must have simultaneously returned to its original spot.

Now, you can factor the power series as $$\sum_{k=0}^\infty X^k=\sum_{k=0}^\infty X^{Nk}\left(\sum_{j=0}^{N-1}X^j\right)=\left(\sum_{k=0}^\infty I\right)\left(\sum_{j=0}^{N-1}X^j\right)$$ and now there's a rather big problem. $I-X$ is never invertible!


A column-stochastic 0-1 matrix has the same issue. Notice that when multiplying with a power of $X$, a column being a basis vector simply means "take that corresponding column of what I'm multiplying".

Therefore, the generalization is that instead of a permutation, you have a general function $\pi:\{1,\ldots,N\}\rightarrow\{1,\ldots,N\}$.

What doesn't change, is that after $N$ (as above) powers of $X$, at least one point must have returned to its original location. Think about tracing the orbits again: you might collapse multiple points into the same location, but eventually they'll be in cycles, and those original points must close their cycles (possibly of length 1).

$I-X$ remains singular.