Let $A\in \mathbb{R}^{m \times n}$ be a dense matrix and $x$ is a given vector in $\mathbb{R}^n$. How can one solve the following problem or its relaxation to find a sparse matrix that acts like $A$?
$$ \min_{\tilde{A}} ||Ax -\tilde{A}x||^2\\ {\text{s.t. }||\tilde{A}||_0\leq s} $$
where $s \in \{1, \dots, m\times n -1\}$ and $||\tilde{A}||_0$ is the number of nonzero elements.
While describing an optimization algorithm for a relaxed version of this problem, I think I found a simple analytical solution for the original problem.
I'll assume $x \neq 0$. Notice that none of the rows of $\tilde A$ should have more than one non-zero entry. (There would be no benefit in that.)
If $s$ is greater than the number of rows of $\tilde A$, then each row of $\tilde A$ will have one non-zero entry (at most), and the optimal objective function value will be $0$.
Otherwise, the optimal $\tilde A$ will have $s$ nonzero rows. Which rows of $\tilde A$ should be nonzero? It should be the rows of $\tilde A$ that correspond to the $s$ largest (in magnitude) components of $Ax$. Those components of $Ax$ will be neutralized, and the optimal objective function value will be the sum of the squares of the remaining components of $Ax$.
Solved!
Below is my original solution using the proximal gradient method. You can see at the end where I start to recognize the simpler solution.
A common technique is to penalize the $\ell_1$-norm of $\tilde A$ to encourage $\tilde A$ to be sparse. This approach has the benefit that the $\ell_1$-norm is convex, so the resulting optimization problem is a convex problem.
In more detail, you can solve the optimization problem $$ \tag{1} \text{minimize} \quad \| \tilde Ax - b \|^2 + \lambda \| \tilde A \|_1 $$ where $b = Ax$ and $ \| \tilde A \|_1$ is the sum of the absolute values of the entries of $\tilde A$. The optimization variable is $\tilde A$. Now solve this optimization problem with a method such as the proximal gradient method or an accelerated proximal gradient method. Tune the value of $\lambda > 0$ by trial and error to achieve the desired sparsity level.
One thing that might be a little confusing is that our optimization variable $\tilde A$ is a matrix, whereas typically optimization algorithms are described with the assumption that the optimization variable is a vector. But, we can just reshape $\tilde A$ into a vector, as follows. Let $a_i^T$ be the $i$th row of $\tilde A$, so $$ \tilde A = \begin{bmatrix} a_1^T \\ \vdots \\ a_m^T \end{bmatrix}, $$ and let $a$ be the column vector obtained by vertically concatenating the column vectors $a_1, \ldots, a_n$: $$ a = \begin{bmatrix} a_1 \\ \vdots \\ a_n \end{bmatrix}. $$ (I have defined $a_i$ so that it is a column vector.) Notice that $$ \tilde A x = \underbrace{\begin{bmatrix} x^T & 0 & \cdots & 0 \\ 0 & x^T & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & x^T \end{bmatrix}}_{X} \begin{bmatrix} a_1 \\ \vdots \\ a_m \end{bmatrix} = Xa $$ The optimization problem (1) can now be expressed equivalently as $$ \tag{2} \text{minimize} \quad \| X a - b \|^2 + \lambda \| a \|_1. $$ The optimization variable is the column vector $a$.
Now that we've formulated our optimization problem, the next step is to solve it using the proximal gradient method. The proximal gradient method minimizes functions of the form $f(x) + g(x)$, where $f$ is convex and differentiable (with Lipschitz continuous gradient) and $g$ is convex and "simple" in the sense that its proximal operator can be evaluated efficiently. (Technically we also need to assume that $g$ is lower semi-continuous, which is typically satisfied in practice. We do not require that $g$ is differentiable -- that's what makes the proximal gradient method so useful.)
The proximal operator of $g$ with parameter $t > 0$ is defined as follows: $$ \text{prox}_{t g}(a) = \arg \min_u \, g(u) + \frac{1}{2 t} \|u - a \|_2^2. $$ The parameter $t > 0$ can be viewed as a "step size" which controls how severely we are penalized for moving away from the input vector $a$. Intuitively, the proximal operator of $g$ reduces the value of $g$ as much as possible without straying too far away from $a$.
The proximal gradient method iteration (with step size $t > 0$) is $$ a^{k+1} = \text{prox}_{t g}(a^k - t \nabla f(a^k)) $$ for $k = 1, 2, \ldots$. The vector $a^0$ can be initialized to all zeros. Intuitively, the proximal gradient method repeats the following two teps: First we take a step in the negative gradient direction to reduce the value of $f$, then we evaluate the proximal operator of $g$ to reduce the value of $g$. Convergence is only guaranteed if the step size $t$ satisfies the step size restriction $t \leq \frac{1}{L}$, where $L$ is the Lipschitz constant for the gradient of $L$. There is also a line search version of the proximal gradient method where the value of $t$ is chosen adaptively at each iteration. The line search procedure does not require knowing the value of $L$.
In our case, we'll take $$ f(a) = \| Xa - b \|^2 \quad \text{and} \quad g(a) = \lambda \| a \|_1. $$ The gradient of $f$ is $$ \nabla f(a) = 2X^T (Xa - b). $$ It can be shown that the proximal operator of $g$ (with parameter $t > 0$) simply "shrinks" each component of the input vector $a$ towards the origin by a distance $\lambda t$, stopping if we hit the origin.
With the above facts, you're now ready to implement the proximal gradient method to solve problem (2).
Here are some further thoughts which I believe are rather important but which I'll need to think more carefully about to check to see if I'm making an error.
I think we will run into a problem in that some columns of $X$ are highly correlated with each other. In particular, the first block of columns of $X$ are all scalar multiples of each other, and likewise for the second block of columns of $X$, etc. When columns of $X$ are highly correlated with each other, the $\ell_1$-norm tends not to promote sparsity as much as we would like. That's because the $\ell_1$-norm can't measure a difference between a solution where just one entry of $a_i$ is non-zero and a solution where that entry of $a_i$ is "spread out" so that now all entries of $a_i$ are non-zero while the value of $Xa$ is unchanged.
I think the simplest way to handle this is to notice that only one component of each vector $a_i$ needs to be nonzero. In other words, for any candidate vector $a$ I believe we can find a candidate vector $\hat a$ for which each block $\hat a_i$ has only one non-zero entry, and such that $\hat a$ is just as good as $a$ in the sense that $Xa = X \hat a$.
Thus, I believe we can reduce the size of the problem by assuming that only the first entry (let's say) of each block of $a$ is non-zero. We now have fewer optimization variables, and we can proceed in a similar way.
Strangely enough, in this approach I think the problem becomes fully separable, meaning that the problem reduces to solving a bunch of independent subproblems, one subproblem for each block of $a$. (And for each subproblem, the optimization variable is a scalar. I think we could even derive an analytical formula for the solution to each subproblem.) I need to check my thought process here carefully because it sounds like too much of a simplification.