How come the covariance matrix encodes rotation parameters and spread of data? I observed that a covariance matrix for an $N$-dimensional dataset has the following number of degrees of freedom (i.e. unique numbers): $N$ + [the number of degrees of freedom needed to describe rotation in the $N^{\text{th}}$ dimension].
But how come a simple squaring of dot products of data points (which is how the covariance matrix is derived and I'm comfortable with this derivation) yields those parameters?
The numbers on the diagonal seem to encode the variance in the direction of the basis vectors and the rest of the numbers above the diagonal seem to encode rotation (at least the number of parameters above the diagonal is the same as the number of degrees of freedom needed to describe rotations in the $N^{\text{th}}$ dimension).
How is the rotation parametrization used by the covariance matrix called (I mean all the covariance parameters, without the variance parameters)? For example, I don't expect it to be Euler angles… How can we convert between the rotation parametrization used by the covariance matrix and other rotation parametrizations (e.q. quaternions in 3D)? Is it even possible?
For example, let's consider the simple case of covariance matrix in 2 dimensions. There are 3 degrees of freedom (DoF):
- 2 DoF for the variance in the $X$ and $Y$ axes of the original coordinate system in which the data points are expressed (as opposed to the variance in the "ideal" coordinate system defined by the eigenvectors of the covariance matrix)
- and 1 more parameter (the covariance between $X$ and $Y$ coordinates). It seems as if the covariance parameter above diagonal somehow encodes rotation.
How do(es) the covariance parameter(s) encode rotation? I would understand if covariance was defined as $\frac{y}{x}$ (where $y$ and $x$ are the coordinates of a data point), which would be the tangent of the angle of the vector $[X, Y]$, but the covariance is defined as $xy$. Is $xy$ a hidden trig identity that can also encode angle? But how does it scale to higher dimensions when there are many such multiplications above the diagonal of the covariance matrix?