I need to compute the pseudo-inverse of a very large rectangular dense matrix without any special structure or properties. I run out of memory/computing power and have no access to a large parallel computing resource.
However, the good news is that I need only one column of the result at a time (for the subsequent calculations).
Is there any iterative algorithm that can compute the 'kth' column (or at-least progressively build up the first 'k' columns of the pseudo-inverse ?). I'd appreciate any inputs/thoughts on this.
PS: I am using MATLAB for now, but the programming environment does not really matter.
I'm not sure this is exactly what you need but since you also asked whether one can "at-least progressively build up the first 'k' columns of the pseudo-inverse": Yes, this is possible in a sense: we can build an algorithm that in the $k$-th step gives us the pseudo-inverse of the first $k$ columns of the given matrix. Not sure if this helps you. In case it does, here it goes:
The idea is to follow an iterative update like procedure. Let us denote $\mathbf{A} = [\mathbf{a}_1, \ldots, \mathbf{a}_K] \in \mathbb{C}^{M \times K}$ the matrix you want to pseudo-invert, where $M\gg K$, since it is rectangular, as you say. Begin with the first column $\mathbf{a}_1$, its pseudo-inverse is nothing but $\mathbf{b}_1 = \frac{\mathbf{a}_1^{\rm H}}{\mathbf{a}_1^{\rm H} \mathbf{a}_1}$, which is easily verified by $\mathbf{b}_1 \cdot \mathbf{a}_1 = \mathbf{I}_1 = 1$. Moreover, if we have a pseudo-inverse of the first $k$ columns, say, $\mathbf{B}_k \in \mathbb{C}^{K\times M}$ wich $\mathbf{B}_k \cdot \mathbf{A}_k = \mathbf{I}_k$, we can find $\mathbf{B}_{k+1} = \left[\mathbf{A}_k, \mathbf{a}_{k+1}\right]^+$ by expanding the pseudo-inverse using rank-one update rules.
The resulting update step can be written like this
It looks a bit like Gram-Schmidt, since in each step, the new column is projected on what we already have from the last $k$ (through $\mathbf{A}_k\cdot\mathbf{B}_k\cdot\mathbf{a}_{k+1}$) and this projection is subtracted.
What is the complexity? For each new column in $\mathbf{A}$, we need
So it is something like $(3k+1)\cdot M$ multiplications in each step, which after $K$ steps is something like $3/2K^2M$ after $K$ steps. Memory-wise its mostly the memory for $\mathbf{B}_k$ which is $M\cdot k$ variables.
For comparison, the standard solution $\mathbf{A}^+ = \mathbf{A}^{\rm H}\cdot(\mathbf{A}^{\rm H}\mathbf{A})^{-1}$ requires $K^2M$ multiplications to get $\mathbf{A}^{\rm H}\mathbf{A}$, something on the order of $K^3$ multiplications to invert it, and another $M K^2$ multiplications to get the final result.
The iterative scheme is particularly useful if we need all the $\mathbf{B}_k$ for $k=1, 2, \ldots, K$, which happens in certain iterative least squares algorithms. Also, some steps can be done very efficiently like the outer product which is perfectly parallelizable.
Here is a quick Matlab demo script which computes all $K$ columns. You can terminate the loop earlier, this will give you a pseudo-inverse of $\mathbf{A}_k$, the first $k$ columns of $\mathbf{A}$ (not the first $k$ columns of the pseudo-inverse of the whole $\mathbf{A}$).