Solve underdetermined, highly multicollinear, poorly scaled, but sparse in Fourier domain, system to high accuracy

49 Views Asked by At

I wish to solve systems of the form $$ \mathbf{y} = \mathbf{A}\mathbf{x} $$ for $\mathbf{x}$, where $\mathbf{y} \in \mathbb{R}^n$ is known, $\mathbf{A} \in \mathbb{R}^{n\times m}$ is known, $\mathbf{x} \in \mathbb{R}^m$ is unknown, and $n \lt m$ (underdetermined). An example of what a small version of $\mathbf{A}$ looks like is $$ \begin{bmatrix}1&1&0&0\\0&0&1&1\end{bmatrix}.$$ There is thus perfect multicollinearity among many of the variables. Each column will only contain one nonzero value (with value 1), but each row can contain multiple nonzero values. You can think of it as having the effect of summing groups of components of $\mathbf{x}$, and I want to undo that summation. The components of $\mathbf{x}$ can differ from each other by several orders of magnitude. It is possible that the magnitude of $x_0$ is 1000 times that of $x_1$, but we see that our only measurement involving these values is their sum in the first equation of my example $\mathbf{A}$ above. $m$ and $n$ are typically a few million, with $m$ often around $2n$ (but sometimes more). Furthermore, I need to solve this to pretty high accuracy, so I need to approximately recover that $x_1$ even though it was small compared to $x_0$.

This is obviously impossible as it stands, but the potential saviour is that $$ ||\mathbf{B}\left(\mathbf{x^*}\right)||_0 \lt ||\mathbf{B}\left(\mathbf{\tilde{x}}\right)||_0, $$ where $\mathbf{B}(\mathbf{x})$ is an operator that reshapes $\mathbf{x}$ into a tensor and applies a multidimensional Fourier transform to it, $\mathbf{x^*}$ is the true solution, and $\mathbf{\tilde{x}}$ is any solution to the system (my first equation) other than the true solution. In other words, the true solution is the solution to the system with minimum L1 norm in this transformed domain. I also expect many elements of $\mathbf{B}\left(\mathbf{x^*}\right)$ to have negligible amplitude (so negligible that they can be ignored even for recovering $x_1$).

The obvious way to solve this then seems to be to hope that the L0 norm above can be replaced with an L1 norm and use a LASSO method such as ISTA (iterative shrinkage and thresholding algorithm) to minimize $$ ||\mathbf{y} - \mathbf{A}\mathbf{x}||_2^2 + \lambda ||\mathbf{B}\left(\mathbf{x}\right)||_1 $$

(or, equivalently, minimize $||\mathbf{y} - \mathbf{A}\mathbf{B^{-1}}\left(\mathbf{\bar{x}}\right)||_2^2 + \lambda ||\mathbf{\bar{x}}||_1$ with respect to $\mathbf{\bar{x}}$, where $\mathbf{\bar{x}} = \mathbf{B}\left(\mathbf{x}\right)$.)

The results are not as accurate as I need, however. I suspect that this is due to the L1 norm not being an accurate replacement for L0 in my situation, and it might additionally be caused by the bias introduced by the method, or maybe due to the fact that LASSO apparently doesn't cope well with multicollinearity. Iterative hard thresholding (which attempts to minimize L0 rather than L1; it starts with a large threshold, sets everything below the threshold to zero, decreases the threshold, uses previous result as warm start, repeat) also does not always yield satisfactory results, possibly due to its apparently high variance.

Another possibility that I have considered is to exploit the fact that we expect $\mathbf{\bar{x}}$ to be somewhat "clumped", with high amplitude magnitudes ($\mathbf{\bar{x}}$ is complex) generally occurring in clusters. I might be able to roughly estimate the reciprocals of these magnitudes as a tensor $\mathbf{M}$ and could then minimize $$ ||\mathbf{y} - \mathbf{A}\mathbf{B^{-1}}\left(\mathbf{\bar{x}}\right)||_2^2 + \lambda_1 ||\mathbf{\bar{x}}||_1 + \lambda_2 ||\mathbf{M} \odot \mathbf{\bar{x}}||_2^2, $$ where $\odot$ is pointwise multiplication. This has the effect of scaling $\mathbf{\bar{x}}$ in the final term so that each component has a standard deviation close(r) to 1. The potential benefit of this would be to make it possible to use Elastic Net without severely underestimating all of the high amplitude components (which would happen without $\mathbf{M}$). Since I don't care about whether the small values get set to exactly zero, I could even drop the L1 term and just use ridge regression. The accuracy of this will depend on how accurately I can estimate $\mathbf{M}$, though, which isn't ideal.

Are there alternative methods that are more suitable for this type of problem? Non-negative garrotte and principal component regression sound like they might be relevant. Given the size of the problem, computational cost is a consideration.