I was going through the definition and meaning of variance and covariance. The resources which I have, have only definition and formula without any insight.
For variance, I wrote the formula and asked myself what this formula tells me. I figured out that variance has do to something with the mean of spread. The term in the formula of variance, $(x_i-\bar x)^2$, takes the square of the distance between the $i^{th}$ observation and mean.
Now moving further if we get data on two attributes, for this we can plot a scatter diagram. I compared this situation with the notion of centre of mass in 2D. The mean of the scatter plot will be $(\bar x,\bar y)$. Covariance is expanding the idea of variance to higher dimensions(this was given in book.).
$$\mathbb{Cov}(X,Y)=\mathbb{E}[(x_i-\bar x)(y_i-\bar y)]$$
Above is the formula of covariance. I could not understand why this particular formula?
Second, if we have a point $(x_i,y_i)$ in scatter plot and we have our mean as $(\bar x,\bar y)$ then the square of distance between them will be $(x_i-\bar x)^2+(y_i-\bar y)^2$. So I thought that if we are generalising the concept of variance in two dimensions then we should have the formula of variance as:
$$\mathbb{Cov(X,Y)}=\mathbb{E}[(x_i-\bar x)^2+(y_i-\bar y)^2]=\mathbb{Var(X)}+\mathbb{Var(Y)}$$
Summary of my problem
In the formula of variance we have "square of distance between ith observation and mean" and covariance is same as variance but for two or more dimension then why we are not using "square of distance between the ith observation and mean"? Why we are using something else and in particular why that formula?
Please help I am struggling to digest it. Please.
Covariance of two random variables $X$ and $Y$ is supposed to measure how they covary.
For instance, take two independent dice rolls, and let the result of the first roll be $X$ and the sum of both rolls be $Y$. If $X$ is higher than expected, then $Y$ is probably also higher than expected. If $X$ is lower than expected, then $Y$ is probably also lower than expected. We want this to lead to a positive covariance.
For another example, we again consider two independent dice rolls, $X$ is still the first roll, but $Y$ is now the second roll minus the first roll. This time, if $X$ rolls higher than expected, $Y$ is probably lower than expected, and if $X$ is lower than expected, then $Y$ is probably higher than expected. $X$ and $Y$ vary in opposite directions, so to speak. We want this to lead to a negative covariance. Positive covariance means that the two random variables vary in the same direction, while negative covariance means that they vary in different directions. For this, covariance must allow for negative values, which your formula already doesn't. Basically, yours just measures how much they vary independently, not how they covary.
Now for why this particular formula was chosen, there are multiple reasons. For one, it'd be nice if the covariance of the random variable with itself would just be its variance. It covaries with itself just the way it varies in general. The usual definition accomplished this: $\operatorname{Cov}(X,X)=\operatorname{Var}(X)$. Another reason is that the usual definition has some pretty cool algebraic properties. It's linear in both variables. It's also symmetric and positive definite in the sense that only almost surely constant random variables have nonzero covariance with themselves. Taken together, this means that covariance is an inner product on suitable spaces of random variables, with all the cool results that brings, like the Cauchy-Schwarz inequality.