I am trying to attain the Moore-Penrose psuedoinverse of a a very large, sparse, rank degenerate, singular, and square matrix. (75000x75000, near rank) I realize the inverse will be very dense. I have been trying some brute force approaches by converting it to a dense matrix (~25GB) on a large memory machine (124GB). Both lstsq (scipy.linalg.pinv) and some SVD approaches (numpy.linalg.pinv) quickly exhaust the memory. The only approach that has worked is a slow but memory efficient SVD (scipy.linalg.pinv2). I am exploring what other techniques are out there that can balance memory efficiency and speed, although I have yet to try scipy.sparse.inv or scipy.sparse.svds for thorough comparison. I don't think any iterative approaches will work well because I can't provide a decent starting point. I think this leaves QR as a prime candidate. I have been exploring some of the methods listed in this paper. I have the following questions:
- I want to sample selected entries of the MPI to calculate the distance between a large number of graph nodes in a ad-hoc fashion. Is there a better way to sample select elements of the MPI than persisting the whole matrix?
- Are there any algorithms particularly well suited to my matrix type?
- Is QR factorization my best bet? Is there some other approaches I should be considering?
- If QR, what memory efficient QR implementations exist? Or will I need to hand roll one from a paper?
There is something called the large randomized SVD. Facebook released research on it years ago. Do not turn your matrix into a dense matrix. There are libraries that utilize CSR structure. randomized SVD .
It is also a function of the degree of sparsity, matrices actually aren't just either sparse or not sparse. It is a percentage. When you look at the function to construct a sparse matrix. It has a parameter called density.