How can I prove that $$ \frac 1 {n-1} \sum_{i=1}^n (X_i - \bar X)(Y_i-\bar Y) $$ is an unbiased estimate of the covariance $\operatorname{Cov}(X, Y)$ where $\bar X = \dfrac 1 n \sum_{i=1}^n X_i$ and $\bar Y = \dfrac 1 n \sum_{i=1}^n Y_i$ and $(X_1, Y_1), \ldots ,(X_n, Y_n)$ an independent sample from random vector $(X, Y)$?
unbiased estimate of the covariance
26k Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail AtThere are 3 best solutions below
On
Additional Comment, after some thought, following an exchange of Comments with @MichaelHardy:
His answer closely parallels the usual demonstration that $E(S^2) = \sigma^2$ and is easy to follow. However, the proof below, in abbreviated notation I hope is not too cryptic, may be more direct.
$$(n-1)S_{xy} = \sum(X_i-\bar X)(Y_i - \bar Y) = \sum X_i Y_i -n\bar X \bar Y = \sum X_i Y_i - \frac{1}{n}\sum X_i \sum Y_i.$$
Hence,
$$(n-1)E(S_{xy}) = E\left(\sum X_i Y_i\right) - \frac{1}{n}E\left(\sum X_i \sum Y_i\right)\\ = n\mu_{xy} - \frac{1}{n}[n\mu_{xy} + n(n-1)\mu_x \mu_y]\\ = (n-1)[\mu_{xy}-\mu_x\mu_y] = (n-1)\sigma_{xy},$$
So the expectation of the sample covariance $S_{xy}$ is the population covariance $\sigma_{xy} = \operatorname{Cov}(X,Y),$ as claimed.
Note that $\operatorname{E}(\sum X_i \sum Y_i)$ has $n^2$ terms, among which $\operatorname{E}(X_iY_i) = \mu_{xy}$ and $\operatorname{E}(X_iY_j) = \mu_x\mu_y.$
On
Wanted to add an additional answer to Jason Kang's question in the comments. We know that $x_{i\neq j}$ and $y_j$ have 0 covariance.
Take it this way Suppose we had $X$ be drawn from a normal. Say $Y = 3 X$. We have perfect correlation, right? Yes, but not between any $x_i, y_j$. $y_j = 3 x_j$, but $y_j \neq 3 x_{i\neq j}$. If I jumble the order of the $y$'s, I will get $0$ correlation.
One can see this from viewing a plot:
> library(data.table)
> tabl <- data.table(x = rnorm(10000))
> tabl[,y:=x*3]
> plot(tabl$y~tabl$x, xlab = "x_i", ylab = "y_i")
> plot(sample(tabl$y,10000)~tabl$x)


Let $\mu=\operatorname{E}(X)$ and $\nu = \operatorname{E}(Y).$ Then \begin{align} & \sum_{i=1}^n (X_i - \bar X)(Y_i-\bar Y) \\[10pt] = {} & \sum_{i=1}^n \Big( (X_i - \mu) + (\mu - \bar X)\Big) \Big((Y_i - \nu) + (\nu - \bar Y)\Big) \\[10pt] = {} & \left( \sum_i (X_i-\mu)(Y_i-\nu) \right) + \left( \sum_i (X_i-\mu)(\nu - \bar Y) \right) \\ & {} +\left( \sum_i (\mu-\bar X)(Y_i - \nu) \right) + \left( \sum_i(\mu-\bar X)(\nu - \bar Y) \right). \end{align}
The expected value of the first of the four terms above is $$ \sum_{i}^n \operatorname{E}\big( (X_i-\mu)(Y_i-\nu) \big) = \sum_{i}^n \operatorname{cov}(X_i,Y_i) = n\operatorname{cov}(X,Y). $$ The expected value of the second term is \begin{align} & \sum_i -\operatorname{cov}(X_i, \bar Y) = \sum_i - \operatorname{cov}\left(X_i, \frac {Y_1+\cdots+Y_n} n \right) \\[10pt] = {} & -n\operatorname{cov}\left( X_1, \frac{Y_1+\cdots+Y_n} n \right) = - \operatorname{cov}(X_1, Y_1+\cdots +Y_n) \\[10pt] & = -\operatorname{cov}(X_1,Y_1) + 0 + \cdots + 0 = -\operatorname{cov}(X,Y). \end{align} The third term is similarly that same number.
The fourth term is \begin{align} & \sum_i \overbrace{\operatorname{cov}(\bar X,\bar Y)}^{\text{No “} i \text{'' appears here.}} = n \operatorname{cov}(\bar X, \bar Y) = n \operatorname{cov}\left( \frac 1 n \sum_i X_i, \frac 1 n \sum_i Y_i \right) \\[10pt] = {} & n \cdot \frac 1 {n^2} \Big( \, \underbrace{\cdots + \operatorname{cov}(X_i, Y_j) + \cdots}_{n^2\text{ terms}} \, \Big). \end{align} This last sum is over all pairs of indices $i$ and $j$. But the covariances are $0$ except the ones in which $i=j$. Hence there are just $n$ nonzero terms, and we have $$ n\cdot \frac 1 {n^2} \left( \sum_i \operatorname{cov} (X_i,Y_i) \right) = n\cdot \frac 1 {n^2} \cdot n \operatorname{cov}(X,Y) = \operatorname{cov}(X,Y). $$
I leave the rest as an exercise.