Estimate large covariance matrix using few samples.

1.1k Views Asked by At

Let $\mathbf{x}$ be a random vector in $\Bbb{R}^n$, such that $\mathbf{x}\sim N(\bar{\mathbf{x}}, \Sigma)$. $N$ observations of $\mathbf{x}$ are available, say $\{\mathbf{x}_i, i=1,\ldots,N\}$.

The mean vector $\bar{\mathbf{x}}$ is given (no need to compute it by the available samples). We want to compute the covariance matrix $\Sigma$. By definition, $$ \Sigma = \frac{1}{N-1}\sum_{i=1}^N (\mathbf{x}_i-\bar{\mathbf{x}})(\mathbf{x}_i-\bar{\mathbf{x}})^\top, $$ but what is the case when $N\ll n$? For instance, let the dimensionality of $\mathbf{x}$ be $n=1000$ and suppose that we have solely $N=10$ observations of it. I think that because of the so-called "curse of dimensionality" we cannot find a sufficient estimation of $\Sigma$. Is that true?

Finally, could we have some better estimation given that $\Sigma=\sigma I_n$? That is, does it help if we require to find just one parameter ($\sigma$), not the full covariance matrix?

EDIT I: What I have tried so far is to find that $\sigma^\star$ which minimizes the Frobenius norm of $\Sigma-\sigma I_n$: $$ \sigma^\star = \underset{\sigma} {\mathrm{argmin}} \left\| \Sigma - \sigma I_n \right\|_F^2, $$ and thus $$ \sigma^\star = \frac{1}{n}\sum_{i=1}^{n}\sigma_{ii}. $$ Does it make sense? Thanks a lot!

EDIT II: The above approach seems to produce meaningful covariance matrices (are they trully meaningful?), but it does not work in practice.

I would like to discuss a meaningful procedure using which would give acceptable estimations of the covariance matrix. Has anyone experience on this kind of estimation/modeling problems? Thanks a lot!

1

There are 1 best solutions below

6
On BEST ANSWER

The sample covariance maxtrix estimator

$$\Theta(X) = \frac{1}{N}\sum_{i=1}^N (\mathbf{X}_i-\bar{\mathbf{X}})(\mathbf{X}_i-\bar{\mathbf{X}})^\top$$

is a maximum likelihood estimator of the covariance matrix of a multivariate Gaussian distribution, see. But it is biased and we have $$E[\Theta(X)]=\frac{N-1}{N}\Sigma$$ Correcting for Bias we get an unbiased estimator as $$\frac{N}{N-1}\Theta(X)=\frac{1}{N-1}\sum_{i=1}^N (\mathbf{X}_i-\bar{\mathbf{X}})(\mathbf{X}_i-\bar{\mathbf{X}})^\top$$ One can see that playing with the scaling parameter, the estimator's variance can be lowered but then this estimator will not be unbiased. Among all unbiased estimators, the estimator (if it exists) having the lowest possible variance achieves the Cramer-Rao lower bound. In general, every estimator has either the same or larger variance than the Cramer-Rao lower bound.

It is known that maximum likelihood estimator achieves Cramer-Rao lower bound at least asymptotically, namely for $N$ large enough. There are many cases where maximum likelihood estimator cannot achieve the Cramer Rao lower bound for some finite $N$. In such cases, there might be theoretically another estimator which achieves the bound. But I personally dont know any such a patologic case. From this point often the best we can do is the maximum likelihood estimator which you already have. For small sample size, one can consider resampling methods for example bootstraping.

About the second part of your question: If covariance matrix is diagonal, this means covariance is zero for every pair $(X_i,X_j)$, $i\neq j$. For jointly Gaussian distributed random variables $X_i$, $i=1,...,N$, uncorrelatedness implies independency. This means every $X_i$ has a marginal Gaussian distribution with mean $\mu_i$ and variance $\sigma_i^2$. Then, unbiased maximum likelihood estimator of the variance of any $X_i$ is given again by sample variance estimator. According to your example $$\hat{\sigma}^2_i=\Theta(X_i)=\frac{1}{N-1}\sum_{i=1}^{10}(X_i-\mu_i)^2$$ for every Gaussian r.v. $X_i$. For every $(X_i,X_j)$, there will be $10$ samples, even for the non diagonal scenario therefore reducing the problem in this way doesnt bring anything extra. However if $\sigma=\sigma_{ii}$ for all $i$, then your sample number will increase to $nN=10^4$, for which one can make a very good estimation.