In what scenarios does one have a closed form equation for the Pearson Correlation - especially s.t. changing parameters changes correlation?

297 Views Asked by At

I want to understand when I can create distributions and random variables (e.g. functions) such that when I change the parameters (or distributions) that specify the distribution (or the parameters of the functions that output random variables) I have changes in the correlation. A closed form equation for the correlation would be really nice for example, so that if I change a parameter of the distribution it's clear the correlation would change.

So we have:

$$ corr(X,Y) = \mathbb E_{X, Y \sim p(x, y \mid \theta_X, \theta_Y)}[ Z_X Z_Y ]$$

where $Z_X, Z_Y$ are the standardized r.v.s (e.g. $(x - \mu_x) / \sigma_x)$. My hope is to get conditions for where the above varies smoothly or I can control how it varies based on the functions for the random variables I choose and the distributions I choose. I know that the above integral can be complicated - especially for closed form things based on a question I previously asked that required Lebgegue integration - so this is why I am asking here. It would also be nice to be able to somehow separate the distributions $p(x \mid \theta_X), p(y \mid \theta_Y)$ but have them be correlated e.g. they are both normal with similar means or their support sets are similar...or something. The reason for this is that $X, Y$ are in fact two "tasks" that are different but correlated, e.g. classifying handwritten digits vs classifying language characters. The tasks can be sampled separately, but in fact are highly correlated. Thus, in a way you can think of $Z_X, Z_Y$ as a standardized (regression) prediction of a model on two related tasks I suppose.

How does one define (adjustable) random variables and distributions such one can vary smoothly the correlation?


Current attempt (inspired from MIT's lecture L12.10 RES.6-012):

Define $X = Z + X'$ and $Y = Z + Y'$ where $Z, X', Y'$ are independent and centered. The R.Vs are correlated by $Z$. Thus the correlation (closed form) equation is:

$$ corr(X,Y) = \frac{Cov(X,Y)}{\sigma_X \sigma_Y} = \frac{\mathbb E[X Y]}{\sigma_X \sigma_Y} = \frac{\sigma^2_Z}{\sigma_X \sigma_Y} = \frac{\sigma^2_Z}{(\sigma_{X'} + \sigma_Z )(\sigma_{Y'} + \sigma_Z)}$$

which makes we wonder if this is really what I am looking for...which feels weird because I am not even "sampling" two tasks as I described above.

Also, this made me think that whatever distribution and rvs/functions I choose, the values will be wrt their standard deviation. So the closed form equation depends on the standard deviation value that comes form the closed form of the distribution I choose.

Similarly, it made me realize that the curcial thing for getting closed form equation is computing the covariance, really in closed form. Or the computing $\mathbb E[Z_X Z_Y]$ (or $\mathbb E[X Y]$) - a closed form for an integral/expecation of product of values.

2

There are 2 best solutions below

7
On

What you're looking for is possibly the concept of copula. Copulas make multivariate modelling very flexible, especially in terms of correlation/dependence. A copula $C$ is defined as a joint distribution of two (or more, but I'll keep it to $d=2$) uniform random variables $U,V \sim \textrm{U}[0,1]$. A famous theorem due to Abe Sklar states that $(1)$ for random variables $X\sim F_X,Y \sim F_Y,(X,Y)\sim F_{X,Y}$ $$F_{X,Y}(x,y)=C(F_X(x),F_Y(y))$$ for some copula $C$ (it is an existence statement). Continuity of the rvs makes the copula unique. $(2)$ given marginals $F_X,F_Y$ and some copula $C(u,v)$, then $C(F_X(x),F_Y(y))$ is a valid joint df with marginals $F_X,F_Y$. This is a very powerful result: you can model marginals and joint separately.

There is a plethora of copulas, in fact you can make up your own, with their own parameters, using any decreasing, continuous, convex $\psi:\mathbb{R}^+\to [0,1],\,\psi(0)=1$ and $\psi(x)\to0$ for $x \to \infty$. This class of functions is called Archimedean copula generators. Some of the copulas are particularly famous because they have closed form solutions for rank correlations (a more sophisticated form of correlation if compared to $\rho$) in terms of the copula parameters (an example is the Gumbel copula). Rank correlations can be computed as integrals involving the copula of $X,Y$, and don't have the attainability restrictions of $\rho$.

A good introductory reference for all this is Alexander J. McNeil, Rudiger Frey and Paul Embrechts (2005) "Quantitative Risk Management: Concepts, Techniques, and Tools".

3
On

The simplest and the most straight-forward solution for your problem is to first standardize your pair of random variables $(X, Y)$ and then model their distribution as $N_2(\mathbf{0}_2, \Sigma)$, where $\mathbf{0}_2$ is the vector $(0,0)$ and \begin{align} \Sigma = \begin{bmatrix} 1 &\theta\\ \theta &1 \end{bmatrix} . \end{align} Here, there is only one unknown parameter in the model, i.e., $\theta$, which is the Pearson correlation of $X$ and $Y$. This model is appropriate because you have standardized the observations on $X$ and $Y$, so that the sample means and variances of the standardized observations are $0$ and $1$, respectively. In this case, any other Gaussian model would be inappropriate, because there the means or variances would be different from $0$ and $1$, respectively. So, for standardized pair of random variables, this is the only appropriate choice of Gaussian model, and this model is specified in terms of the Pearson correlation of $X$ and $Y$.

Copulas would be another approach, but there, generally the model is not specified in terms of the Pearson correlation of $X$ and $Y$, and the pearson correlation may be very difficult to compute.

In scenarios involving skewed or heavy-tailed distributions of $X$ and $Y$, based on standardized random variables, you can select other non-Gaussian distributions also, for example, the bivariate skew-normal distributions and the bivariate $t$-distributions. However, those cases introduce additional parameters, for example for the skew-normal distributions you have to specify the slant parameter, for the $t$-distribution you need the degree of freedom.

In your example model, where $X = Z + X'$ and $Y = Z + Y'$, $\sigma_X = \sigma_Z + \sigma_{X'}$ and $\sigma_Y = \sigma_Z + \sigma_{Y'}$, which indicates that the random variables are not scaled, though they are centered. You need to standardize them, i.e., center and scale both.