Two definitions of covariance of a multivariate distribution seem to give different answers

87 Views Asked by At

The formal definition of the covariance between two random variables $X$ and $Y$ is $$\mathrm{cov}(X, Y)=\big\langle\big(X-\langle X\rangle\big)\big(Y-\langle Y\rangle\big)\big\rangle,$$ whereas the usual shortcut, easily proven to be equivalent, is $$\mathrm{cov}(X, Y)=\langle XY\rangle - \langle X\rangle\langle Y\rangle.$$ (Notational note: I use $\langle X\rangle$ to denote the expectation value of the random variable $X$.)

As an example, consider the joint probability density function given by $$f(x,y)=\mathrm{Pr}[X{\approx}x; Y{\approx}y]=\cases{{kx\textrm{ on the unit square}}\\{0\textrm{ otherwise},}}$$ where $k$ is a constant.

I now compute some expectation values:

$$\langle X\rangle=\int_0^1\int_0^1 x\, f(x,y)\,\mathrm{d}x\,\mathrm{d}y = \frac13 k.$$ $$\langle Y\rangle=\int_0^1\int_0^1 y\, f(x,y)\,\mathrm{d}x\,\mathrm{d}y = \frac14 k.$$ $$\langle XY\rangle=\int_0^1\int_0^1 xy\, f(x,y)\,\mathrm{d}x\,\mathrm{d}y = \frac16 k.$$

Thus, computing $\mathrm{cov}(X, Y)$ by the formal definition, I get $$\begin{align}\mathrm{cov}(X, Y)&=\big\langle\big(X-\langle X\rangle\big)\big(Y-\langle Y\rangle\big)\big\rangle\\&=\int_0^1\int_0^1\big(x-\tfrac13k\big)\big(y-\tfrac14k\big)\,f(x,y)\,\mathrm{d}x\,\mathrm{d}y\\&=\frac{k}{24}(k-2)^2.\end{align}$$

But using the shortcut definition, I instead get $$\mathrm{cov}(X, Y)=\langle XY\rangle - \langle X\rangle\langle Y\rangle = \frac16 k-\frac{1}{12} k^2.$$

Why do I get two different answers? Which solution is right, and what did I do wrong in the other one?

1

There are 1 best solutions below

3
On

All of your integrations seem to be correct, but there are several difficulties here. All worth straightening out, before they cause you difficulties later.

First, because you are using integrals (no sums) you must have a bivariate continuous density function on the unit square. In that case, you cannot correctly write $f(x,y) = P(X=x, Y=y).$

Second, in order for $f(x,y) = kx$ to integrate to $1$ over the unit square, you must have $k = 2.$ (That is what the Comment means by 'normalization'.) Also, it is clear that $X$ and $Y$ are independent, so the covariance is $0.$ 'Clearly independent' because $f_{X,Y}(x,y) = (2x)(1) = f_X(x)f_Y(y),$ for $0 < x < 1$ and $0 < y < 1.$ This means that $X \sim Beta(2,1)$ and $Y \sim Unif(0,1),$ independently.

Third, you have $E(X) = 2/3,\,$ $E(Y) = 1/2),\,$ and $E(XY) = 1/3.$

Then, as in the Comment by @stochasticboy321, $E[(X = E(X))(Y = E(Y))] = 0$ and also $E(XY) - E(X)E(Y) = 0.$


Finally, I think this is not a particularly good example because the covariance is $0.$ Perhaps a better example is to have the joint density function be defined as $k$ on the triangle with vertices at the origin, $(1,1)$ and $(0.1).$

Obviously not independent because $P(X > .5) > 0$ and $P(Y < .5) > 0,$ while $P(X > .5, Y < .5) = 0.$

Maybe you can try that for practice. You can begin by showing that $k = 2$ (here also). Take care with the limits on the integrals, integrating over a triangle.

I simulated this example in R by generating a million points uniformly at random in the unit square, and then trowing away the ones for which $X > Y.$ Then approximate quantities are found in the simulation: $E(X) \approx .33,$ $E(Y) \approx .67,$ and $Cov(X,Y) \approx .028$ (two ways). Also (not shown), $E(XY) \approx .25.$ You can use these approximate answers to check your exact ones, if you choose to try the extra problem.

 m = 10^6;  x = runif(10^6);  y = runif(10^6)
 acc = x < y;  x = x[acc];  y = y[acc]  # keep relevant points
 mean(x);  mean(y);  cov(x,y)
 ## 0.3330813
 ## 0.6665181
 ## 0.02760107
 mean(x*y) - mean(x)*mean(y)
 ## 0.02760101

A plot of the first 30,000 of the accepted points illustrates the uniform distribution over the indicated triangle.

enter image description here