According to Wikipedia, the formula for combining the covariances of two sets is: $$C_X=C_A+C_B+(\overline{x}_A-\overline{x}_B)(\overline{y}_A-\overline{y}_B) \cdot\frac{n_An_B}{n_X} $$ where:
- $A$ and $B$ are the first and second sets.
- $C$ is the Covariance.
- $n$ is the number of samples.
- $n_X = n_A + n_B$.
- $x$ and $y$ are the features.
I implemented this formula by splitting one dataset into two equal sets, for testing purposes, yet the result is quite different from the original dataset covariance.
Now, let $M_{AB}$ be this part of the above formula:
$$(\overline{x}_A-\overline{x}_B)(\overline{y}_A-\overline{y}_B) \cdot\frac{n_An_B}{n_X}$$
Looking at this implementation, the author basically applied the following formula:
$$ C_X = \frac{(C_A \color{red}{\cdot n_A}) + (C_B \color{red}{\cdot n_B}) + M_{AB}}{\color{red}{n_X}} $$
which gives the correct combined covariance!.
I could not understand how the latter is derived or achieved algebraically? Or if it's even similar to the former formula? because there are extra $\color{red}{n_A}$ and $\color{red}{n_B}$ that are added to the second formula!
Your help is appreciated.