I have a matrix with (supposedly) i.i.d. standard normal entries of dimension 2500*50000. When I find the singular values of this matrix, I am getting the smallest singular values to be very close to 0 (in fact several 100 of them are close to 0) and the largest singular value is O($10^3$). This is completely at ends with what I expect from Marcenko Pastur (in fact, since P/N = 20, I expect the distribution to be rather tight).
Could someone please tell me if (a) I should revisit my calculations (which I will do in any case) because this sort of stuff cannot happen or if (b) small deviations from the Wishart assumptions can cause huge changes in the lower singular values.
**************************** EDIT ********************
The more general problem: I have a genotype matrix (rows are the 'supposedly' unrelated people (in reality there is always some relationship (people may be second cousins without knowing it) and columns are the SNPs (some tests could be done to justify supposed independence) ). Allele frequencies are estimated by finding column sums. Then, all entries in the matrix are centered and scaled using binomial means and standard errors (with allele frequency as the parameter). This centered-scaled matrix will behave like a Wishart-like matrix (Soshinkov has shown this to be true so long as the first four moments are about in the same ball-park http://arxiv.org/pdf/math/0104113v2.pdf) as far as the largest eigenvalue goes (so Tracy Widom theory can be used). Question is can I say something about the smallest singular value? I would expect the smaller singular values are generally more sensitive to perturbations, but (as Dan points out) as long as moments of the distribution are in the correct ball park, Marcenko-Pastur is pretty robust. I am wondering what makes it crack?
My best guess is that there is some issue with the calculations. Are you using Gaussian random variables with zero mean and unit variance? A single large eigenvalue can result from having non-zero mean (i.e. using
rand()instead ofrandn()). The zeros are more perplexing; you're correct to assume that all singular values will be pretty large. I guess you would get a lot of zeros if you tookeig(M'M)rather thaneig(MM'), in which case you would just consider the non-zero ones.For a matrix this large I would not expect deviations from the Wishart assumptions to be responsible for the behavior you observed. To convince you of this fact, I did a quick calculation of the svd of this matrix using Gaussian and uniform random variables with zero mean and unit variance. You can see from the image below that the distributions of singular values are almost indistinguishable.