Random variables and ordres of joint distributions

159 Views Asked by At

I have a question on exchangeable random variables. I have always assumed that given random variables $X$ and $Y$ we have equality of the two joint distributions $p(X,Y)=p(Y,X)$ (don't we need this how we get the Bayes' theorem), but learning the definition of exchangeable random variables, I realized this is only possible if $X$ and $Y$ are exchangeable.

But then, how do we get the Bayes' theorem in the general case? Don't we define $p(Y|X)=p(Y,X)/p(X)$ and $p(X|Y)=p(X,Y)/p(Y)$?

Edit: I can understand the case where we interpret $(X,Y)$ as events or sets $(X=x,Y=y)$ as the answers below say. But I still have confusion over the general case. Namely, where we have the joint distribution $P_{X,Y}(dx,dy)=P^x_Y(dy)P_X(dx)$ where $P^x_Y$ is the conditional distribution of $Y$ given $X=x$. Now then when we apply Bayes, don't we need get $P_{X,Y}(dx,dy)=P_{Y,X}(dy,dx)$ since we should have $P_{Y,X}(dy,dx)=P_X^y(dx)P_Y(dy)$ by exchanging $X,Y$ above and $P_Y^x(dy)P_X(dx)=P_X^y(dx)P_Y(dy)$ is Bayes' theorem? So what goes wrong in the case of general joint distributions and not equivalences of set probabilities? This product rule of the joint distribution is from the general case of https://en.wikipedia.org/wiki/Bayes%27_theorem.

And in the case of continuous random variables, why is the numerator for $f_{X|Y=y}(x,y)$ and $f_{Y|X=x}(x,y)$ both $f_{X,Y}(x,y)$, the joint density of $(X,Y)$? Shouldn't the second be $f_{Y,X}(y,x)$, the joint density of $(Y,X)$? I can't see why $f_{X,Y}(x,y)=f_{Y,X}(y,x)$ since this is not a set intersection but one is derived from differentiating the joint distribution function of $(X,Y)$ $F_{X,Y}$ and the other from $(Y,X)$ $F_{Y,X}$. So my question mainly is why are we able to ignore the order of the random variables when we consider the joint distributions, densities when they are not exchangeable.

3

There are 3 best solutions below

12
On BEST ANSWER

I think is more of a notational confusion rather than conceptual. When we write $P(A, B)$, we usually mean, event $A$ and event $B$ happens, so in this sense $P(A, B) = P(B, A) = P(A\cap B)$. Note that $P$ is not a function here, just a convention to talk about probabilities of events.

On the other hand, if we are talking about functions, the meaning is completely different. In this case $P(A, B)$ means the first variable is $A$ and the second variable is $B$. In math, you can consider symmetric functions to better understand exchangability. For example, $f(x,y) = x^2 + y^2$ is symmetric so $f(x,y) = f(y,x)$, i.e. you can swap the first and second variables. But this is not true in general, of course. Exchangeability is a similar idea in probability theory.

Update: I think I better understand where you got confused (and I must admit that it's more subtle than it looks). I will use the notation you used to clarify things.

First of all, suppose that $X$ and $Y$ are defined on the same set $[0,1]$ so that the support of the joint distribution wouldn't complicate the discussion.

What you are saying is this $$f_{X,Y}(X=a, Y=b) = f_{Y,X}(Y=b, X=a)$$ which is always true. Because this identity follows from a simple change of variables (coordinates): you are expressing the joint density in the $X$-$Y$ plane on the LHS and in the $Y$-$X$ plane on the RHS. Therefore, $f_{X,Y}(X=a, Y=b)$ and $f_{Y,X}(Y=b, X=a)$ live in different coordinate systems, and of course, the joint density when $X=a$ and $Y=b$ should be the same regardless of the coordinate system so they are equal.

However, the exchangeability concept is different. Its definition with this notation, would be something like this $$f_{X,Y}(X=a, Y=b) = f_{X,Y}(X=b, Y=a), \quad \forall a,b\in [0,1]\times [0,1]$$ So we require the joint density function to be symmetric with respect to the $y=x$ line in the same $X$-$Y$ plane system.

If you want a concrete example, take $f_{X,Y}(X=a, Y=b)= 6ab^2$. Then of course, $f_{Y,X}(Y=b, X=a) = 6ab^2$ but $$f_{X,Y}(X=a, Y=b) = 6ab^2 \neq 6ba^2 = f_{X,Y}(X=b, Y=a), \quad \forall a,b\in [0,1]\times [0,1]$$ so $X$ and $Y$ are not exchangeable.

On the other hand, the bivariate Gaussian distribution is exchangeable. For simplicity look at the standard bivariate $$ f_{X,Y}(X=a, Y=b) = \frac{1}{2\pi \sqrt{1-\rho^2}}\text{exp}[-\frac{1}{2(1-\rho^2)} (a^2 - 2ab\rho + b^2 ) ] $$ $$ f_{X,Y}(X=b, Y=a) = \frac{1}{2\pi \sqrt{1-\rho^2}}\text{exp}[-\frac{1}{2(1-\rho^2)} (b^2 - 2ba\rho + a^2 ) ] $$ $$ \implies f_{X,Y}(X=a, Y=b) = f_{X,Y}(X=b, Y=a) $$

Update 2 re-your question: When you express the joint distribution you decide which variable goes to x-axis and which to y-axis so you choose to work either in (X,Y) or (Y,X) coordinate system and in that sense, the chosen coordinate system induces an order between X and Y (i.e. which one is first variable). On the other hand, the expressions $f_{Y|X=x}(y)$ and $f_{X|Y=y}(x)$ don't involve any order- there is no "first" or "second" variable in these expressions. So if you were thinking "$f_{X|Y=y}(x)$ is computed from $f_{X,Y}(X,Y)$ so $f_{Y|X=x}(y)$ should be computed from $f_{Y,X}(Y,X)$", that is not true because there is no order implied in conditional densities. But in principle, you could compute $f_{Y|X=x}(y)$ using $f_{Y,X}(Y,X)$ but then you would be working in two coordinate systems, which is not practical. For example, when you compute both of the marginal densities from the same $f(X,Y)$, we have a nice geometric illustration for independence: at any point $(a,b)$, the joint density is the product of two marginal densities that intersect at the right angle but if you compute the marginals from different coordinate systems they will live in two different planes! In general, if you carry around two coordinate systems in your analysis, it would be ambiguous which variable is $X$ which is $Y$ in an expression like $f(a,b)$ so you would have to write $f(X=a,Y=b)$, $f(Y=a,X=b)$ each time or use a different function, say $g$, for $Y$-$X$ coordinate system so that when you write $g(a,b)$, it would be clear that $Y=a$ and $X=b$. Alternatively, as you did initially, the subscripts of $f$ in $f_{X,Y}(X,Y)$ $f_{Y,X}(Y,X)$ differentiate which coordinate system we are working in so they are not both $f$. But again, there is no reason to let all this confusion in. Simply we stick to the same coordinate system.

3
On

First, the statement $P(X, Y)=P(Y, X)$, where $X, Y$ are random variables and $P(X, Y)$ is their joint distribution, is not true. $P(X, Y)$ is a probability measure defined on $\sigma(X) \times \sigma(Y)$, where $\sigma(X), \sigma(Y)$ are the sigma-algebras of measurable sets on $X$ and $Y$ respectively. $\sigma(X) \times \sigma(Y)$ and $\sigma(Y) \times \sigma(X)$ are different sets. Example: define two discrete distributions, $X$ at $\{0, 1\}$, $Y$ at $\{0, 1, 2\}$. The joint distribution $P(X, Y)$ is then defined at $\{0, 1\} \times \{0, 1, 2\}$, while $P(Y, X)$ is defined at $\{0, 1, 2\}\times \{0, 1\}$. Even if $X$ and $Y$ are defined on the same set, their joint distribution does not have to be symmetric with respect to swapping of the arguments. Moreover, the symmetry of joint distribution is not required even for a pair of independent random variables on the same set: consider independent $X$ on $\{0, 1\}$ that is $1$ with probability 1, and $Y$ on $\{1, 0\}$ that is $0$ with probability 1. Then the joint distribution is described by the $2\times 2$ matrix $\pmatrix{0 & 0 \\ 1 & 0}$, which is obviously not symmetric with respect to transposition.

Second, the Bayes theorem uses the symmetry of set intersection, not the symmetry of joint distribution, and its statement is about probabilities of events. Given events $A, B$, $P(A|B)= \frac{P(B|A)P(A)}{P(B)}$. The theorem is proved by noting that, since $P(A|B)=\frac{P(A\cap B)}{P(B)}$ $$P(A|B)P(B) = \frac{P(A\cap B)}{P(B)} P(B) = P(A \cap B) = \frac{P(A \cap B)}{P(A)}P(A) = P(B|A)P(A)$$

You can apply the theorem to the joint distribution $P(X, Y)$ to get $$P(X=x|Y=y)=\frac{P(Y=y|X=x)P(X=x)}{P(Y=y)}$$ $$P(Y=y|X=x)=\frac{P(X, Y)[X=x \cap Y=y]}{P(X)[X=x]}$$ where $P(X)$ and $P(Y)$ are the marginal distributions, obtained by integrating $P(X, Y)$ with one fixed argument.

0
On

I now see where I misunderstood things and the correct way to think about this. So first in the two dimensional case, say we have the random vector $(X_1, X_2)$. Then this is specifically ordered by the indices. So when we have the pdf $f_{1,2}(x,y)$, where I suppressed $X$ in the subscript of $f$ for convenience, $x$ corresponds to the value for the first coordinate $X_1$ and $y$ for the second coordinate $X_2$.

Because of this we have $f_{X_2|X_1=x}(y)=f_{1,2}(x,y)/f_1(x)$ and not $f_{X_2|X_1=x}(y)=f_{2,1}(x,y)/f_1(x)$ because by conditioning on $X_1=x$ we must have $X_2=y$ and the probability density that gives this is precisely $f_{1,2}(x,y)$. $f_{2,1}(x,y)$ means that $X_2=x$ and $X_1=y$, which contradicts the conditioning. So in essence, the order is predetermined when we consider the random vector $(X_1,X_2)$.

Therefore, we can generalize this to $n$-dimensional case. For convenience in the $3$-dimensional case, if we are given $(X_1,X_2,X_3)$, then we can decompose the joint density in any order, i.e., $f_{1,2,3}(x,y,z)=f_1(x)f_{2|1=x}(y)f_{3,1=x,2=y}(z)$, but also we can permute $X_1,X_2,X_3$ in the decomposition in any order such as $f_2(y)f_{3|2=y}(z)f_{1|2=y,3=z}(x)$ because the order is fixed and $x$ corresponds to the first cooordinate value of $X_1$, $y$ to $X_2$, $z$ to $X_3$ and the joint density for this is represented by $f_{1,2,3}(x,y,z)$.