I face a problem, where I have a total number of $c$ samples $S^{c\times r}$ of $r$ features. These are split at a position $p\in{1...c}$ into two subsets $S_{left}^{p\times r}$ and $S_{right}^{c-p\times r}$. For both of these subsets, I have to calculate the determinant of the co-variance matrix $\Sigma$. This I achieve through:
- Computing the mean feature vector over all samples $\mathbf{\mu}_{left}$ with $|\mathbf{\mu}_{left}| = r$
- Centering the matrix $S_{left,c} = S_{left} - \mathbf{\mu}_{left}$
- Computing the co-variance matrix with $\Sigma(S_{left,c}) = \frac{1}{p-1}S_{left,c}^\top S_{left,c}$
- Computing the determinant $\det(\Sigma(S_{left,c}))$
So far, so good. Now the split position $p$ is changing over time (monotonically increasing) and samples have to be swapped from the right to the left subset. My question: Is there any way to update the $\Sigma$ without minimum computational costs?
The problem is, that in my case $c>>r$ and the above approach depends strongly on the size of $c$. I've figured out how to update $\mathbf{\mu}$ with $\mathcal{O}(r)$. Equally, I can update $\Sigma$ with $\mathcal{O}(r^2)$, but only under the assumption of a static $\mathbf{\mu}$. How can I achieve at least an estimated $\det(\Sigma)$ after adding/removing a sample?
Additional questions:
- Can I use Cholesky decomposition to compute the determinant? The Wiki entry on determinants states that the $\Sigma$ would have to be positive definit for this, but as far as I understand, its only positive semi-definite. I ask, because there has been an interesting answer to a similar problem, that frankly, I do not get, especially where the Sylvester's determinant theorem fits in.
Solution
I've found a solution through paper-work and tested it.
Formulation
Given a sample of $n$ independent obseravtions $\mathbf{x}_1,...,\mathbf{x}_n$ of length $m$, the sample co-variance matrix of $X\in R^{n\times m}$ is given by
$$ Q = \frac{1}{n-1} \sum_{i=1}^n(\mathbf{x}_i - \mathbf{\hat{x}})(\mathbf{x}_i - \mathbf{\hat{x}})^T $$
where $\mathbf{x}_i$ denotes the $i$-th observation and
$$ \mathbf{\hat{x}} = \frac{1}{n}\sum_{i=1}^n\mathbf{x}_i $$ is the sample mean. Re-organizing the first formula, we get
$$ Q = \frac{1}{n-1}\left[\sum_{i=1}^n\mathbf{x}_i\mathbf{x}_i^T - \mathbf{\hat{x}}\left(\sum_{i=1}^n\mathbf{x}_i\right)^T - \left(\sum_{i=1}^n\mathbf{x}_i\right)\mathbf{\hat{x}}^T + n\mathbf{\hat{x}}\mathbf{\hat{x}}^T\right] $$ which is essentially the application of $(a-b)^2 = 2a^2 - 2ab - b^2$. Substituting the sums, we get $$ Q = \frac{1}{n-1}\left[A - \mathbf{\hat{x}}\mathbf{b}^T - \mathbf{b}\mathbf{\hat{x}}^T + n\mathbf{\hat{x}}\mathbf{\hat{x}}^T\right] $$
Updating
From this form, we can derive an efficient update of $Q_{n+1}$ when a new sample $\mathbf{x}_{n+1}$ becomes available. First we update $A$ and $\mathbf{b}$ $$ A_{n+1} = A_n + \mathbf{x}_{n+1}\mathbf{x}_{x-1}^T\\ \mathbf{b}_{n+1} = \mathbf{b} + \mathbf{x}_{x+1}\\ \mathbf{\hat{x}}_{n+1} = \frac{1}{n+1}\mathbf{b}_{n+1} $$ to finally compute the updated co-variance matrix $$ Q_{n+1} = \frac{1}{n}\left[A_{n+1} - \mathbf{\hat{x}}_{n+1}\mathbf{b}_{n+1}^T - \mathbf{b}_{n+1}\mathbf{\hat{x}}_{n+1}^T + (n+1)\mathbf{\hat{x}}_{n+1}\mathbf{\hat{x}}_{n+1}^T\right] $$
Speed
The update is considerably cheaper than the computation of the complete sample co-variance matrix, not only when $n>>m$:
Relation to Kalman Filter
I am not sure if this solution has any relation to the Kalman Filter. But since I do not use prediction, innovation or the Kalman gain, I do not think so. Maybe someone can correct me or confirm this observation?
Implementation
https://github.com/loli/dynstatcov