I'm interested in solving the following problem:
Given a dataset $\mathcal{X} = \{x_i\}_{i=1}^n\subset\mathbb{R}^p$, find a smaller set $\mathcal{Y} = \{y_i\}_{i=1}^k\subset{\mathbb{R}^p}$ whose sample covariance matrix, $C_\mathcal{Y}$, is a good approximation to that of $\mathcal{X}$, $C_{\mathcal{X}}$.
To make this a little more precise, is there a good way of finding $\mathcal{Y}$ such that $\left\|C_\mathcal{X} - \mathcal{C}_\mathcal{Y}\right\| \le \varepsilon$, or such that $\left\|C_{\mathcal{X}} - C_{\mathcal{Y}}\right\|$ is near-minimal for fixed $k$?
One idea would be to choose $\mathcal{Y}$ by sampling $\mathcal{X}$ at random. Perhaps a better way to do this, though, is to make $\mathcal{Y}$ the result of Lloyd's algorithm/$k$-means on $\mathcal{X}$, which would help to better summarize the entire dataset.
In general, is there a decent way to solve this problem, preferably with good theoretical bounds on $\left\|C_{\mathcal{X}} - C_{\mathcal{Y}}\right\|$ (where $\left\|\cdot\right\|$ is just a common matrix norm, e.g. Frobenius or spectral)?
Ideally I could find $\mathcal{Y}$ in time complexity similar to that of Lloyd's algorithm, or at least in time that is sub-quadratic in $n$ (so preferably cheaper than doing something like PCA). But if you have ideas for methods/algorithms that are more computationally expensive that haven't been mentioned here I'd also appreciate hearing those.
Thank you!
Here is an approach which produces a sample $\mathcal{Y}$ such that $C_\mathcal{Y} = C_\mathcal{X}$.
Let $y_{ij}$ be the $j$ coordinate of event $i$, where $i=1,2,...,k$ and $j=1,2,...,p$. Require $k$ to be even.
Choose $k=2m$ to be large enough, so that $m$ events have at least the same number of degrees of freedom as $C_\mathcal{X}$. In other words $$ m\,p \ge \frac{1}{2}\,p\,(p+1) \ \ \ \rightarrow \ \ \ m \ge \frac{1}{2}\,(p+1)$$
The sample consists of pairs of events, $\vec{y}_{i+m} = - \vec{y}_i$ to ensure that the sample mean is $0$, to simplify calculations.
Solve the set of quadratic equations for the event coordinates:
$$\eqalign{ {C_\mathcal{X}}_{jj}=\frac{2}{k}\sum_{i=1}^{k/2}y_{ij}^2 \\ {C_\mathcal{X}}_{jj'}=\frac{2}{k}\sum_{i=1}^{k/2}y_{ij}y_{ij'} } $$