I want to solve the classical projection problem: given $x_0:n\times 1$ and a sparse matrix $C: m \times n$, solve for: $$ x = \operatorname{argmin}|x-x_0|^2,\ s.t. \\ Cx = 0 $$ The three usual ways to do this require either:
- Sparse QR for $C^T$, or
- Sparse LDLT for $C\cdot C^T$, or,
- Solving the KKT equation with Lagrange multipliers.
The problem that I am facing is that $m\gg n$, and that both $n$ and $m$ can be pretty large, where $C$ is still unique in that it has a big (right) null space. Therefore, (2) fails since $C\cdot C^T$ is not invertible (and huge besides), and (3) fails because $C$ has non-independent rows, requiring (1), which is alas very expensive for a matrix this size. Is there a gradient descent way to do this? or some direct method with alternative decompositions?
Expanding on my comment: Let $$\widehat{x} = \text{argmin}\|x-x_0\|_2^2 \quad \text{s.t.} \quad Cx = 0, \quad (1)$$ and for $\rho > 0$, let $$\widehat{x}_{\rho} = \text{argmin}\|x-x_0\|_2^2 + \rho\|Cx\|_2^2. \quad (2)$$
To solve $(2)$, we can use gradient descent. Initialize $x^{(0)}_{\rho} = $ some guess, and iterate $$x^{(k+1)}_{\rho} = x^{(k)}_{\rho} - \gamma\left(2(x^{(k)}_{\rho}-x_0)+2\rho C^TCx^{(k)}_{\rho}\right).$$ Note that each iteration requires multiplying a vector by $C$, multiplying the result by $C^T$, and then a few vector operations. If $C$ is sparse, each iteration should be fairly quick. Assuming you choose the stepsize $\gamma > 0$ well, these iterations $x^{(k)}_{\rho}$ should converge to $\widehat{x}_{\rho}$ reasonably fast.
I believe it can be shown that $\displaystyle\lim_{\rho \to \infty}\widehat{x}_{\rho} = \widehat{x}$, i.e. the solution to problem $(2)$ converges to the solution to problem $(1)$ as $\rho \to \infty$. So for large enough $\rho$, solving $(2)$ should give a good approximation to the solution to $(1)$.
One method to solve $(1)$ is to pick a sequence of values $\rho_1 < \rho_2 < \ldots$, and for each $\rho_i$, use gradient descent to find $\widehat{x}_{\rho_i}$ where $x^{(0)}_{\rho_i} = \widehat{x}_{\rho_{i-1}}$ is used as the initial guess (i.e. use the solution for the previous value of $\rho$ to initialize gradient descent for the current value of $\rho$.)