How many variables can be pairwise anticorrelated

1k Views Asked by At

I am working on a computational project involving analysis of data. Each item of data that I have has a few hundred attributes; I have several million items of data. The attributes are essentially random variables and I am computing the correlation of each pair of variables. The formula to do that is:

$$ \rho_{xy} = \frac{\mathbb{E}( (X-\mu_{x})(Y-\mu_{y}) )}{\sigma_{x}\sigma_{y}}$$

where $X$ and $Y$ are the attributes. This value always lies between $+1$ and $-1$. Intuitively, if $\rho_{xy}\approx +1$ then $X$ and $Y$ typically differ from their respective means in the same way, i.e. if $X$ is above average then so is $Y$ and if $X$ is below average, then so is $Y$. This is called positive correlation. Similarly, if $\rho_{xy}\approx -1$ then $X$ and $Y$ typically differ from their means in opposite directions, i.e. if $X$ is above average, then $Y$ is below and vice versa. This is called negative correlation.

From all of my data, I build a correlation matrix $C=[\rho_{ij}]$ which contains the correlation coefficient of the $i^{\text{th}}$ and $j^{\text{th}}$ variable in the $ij$-entry. I want to view this matrix as an edge-weighted graph and perform clustering. A "positive" clique would correspond to a number of pairwise positively correlated variables. It is clear that many variables can be pairwise positively correlated; so I could potentially see large "positive" cliques in the graph.

My question is: How many variables could I see that are pairwise negatively correlated? Intuition tells me probably only 2. But I cannot prove this. Basically, I want to define "negative" cliques--a set of nodes all of whose edges are weighted with (nearly) $-1$, and I want to know how large my "negative" cliques could be.

Edit: Perhaps a better way to ask the question is: to what degree can "anti-correlation" be transitive? I.e. if $x$ is (strongly) anti-correlated to $y$, and $y$ is (strongly) anti-correlated to $z$, how strongly can $x$ and $z$ be anti-correlated?

Also, if there is a way to say how large this "negative" clique could be as a function of how near the edge weights are to $-1$ that would be helpful. The idea that there largest negative clique has 2 vertices is working on the assumption that I only have an edge when the weight is exactly $-1$. If I relax that to an edge when the weight is smaller than $-1+\epsilon$ (small $\epsilon$) then I can probably have slightly more than 2 vertices--how many more is the question (and how does that related to $\epsilon$).

1

There are 1 best solutions below

0
On BEST ANSWER

We can simplify the set-up a little by translating the variables to have mean $0$, and scaling them to have standard deviation $1$. If we label the variables $X_1, X_2, ..., X_m$, then we have $\rho_{ij} = \mathbb{E}[X_i X_j]$. Moreover, the normalisation gives $\mathbb{E}[X_i^2] = 1$ for every $i$.

Now suppose you consider a pair of variables to be strongly anticorrelated if $\rho_{ij} \le - \rho$ for some $0 < \rho \le 1$. Moreover, suppose $X_1, X_2, ..., X_k$ form a clique of strongly anticorrelated variables. We shall now bound $k$ in terms of $\rho$. We have $$ 0 \le \mathbb{E} \left[ \left( \sum_{i=1}^k X_i \right)^2 \right] = \mathbb{E} \left[ \sum_{i=1}^k X_i^2 + \sum_{i \neq j} X_i X_j \right].$$

Using linearity of expectation, $$ 0 \le \mathbb{E} \left[ \sum_{i=1}^k X_i^2 + \sum_{i \neq j} X_i X_j \right] = \sum_{i=1}^k \mathbb{E} \left[ X_i^2 \right] + \sum_{i \neq j} \mathbb{E} \left[ X_i X_j \right]. $$

We have $\mathbb{E} [X_i^2] = 1$ for all $i$, and by assumption $\mathbb{E}[X_i X_j] \le - \rho$, so $$ 0 \le \sum_{i=1}^k \mathbb{E} \left[ X_i^2 \right] + \sum_{i \neq j} \mathbb{E} \left[ X_i X_j \right] \le k - k(k-1)\rho.$$

In particular, $k(k-1)\rho \le k$, which can be solved to give $$ k \le 1 + \frac{1}{\rho}. $$

Hence, if $\rho$ is close to $1$, then as you suspected, you cannot have three pairwise-strongly-anticorrelated variables. In order to have such a set of three variables, you need $\rho \le \frac12$, as suggested in Jyrki's comment. More generally, I believe the following set of variables with $k$ attributes (additional attributes can be set to $0$ if needed) shows the bound is sharp: $$ X_1 = \begin{pmatrix} k-1 \\ -1 \\ -1 \\ \vdots \\ -1 \end{pmatrix}, X_2 = \begin{pmatrix} -1 \\ k-1 \\ -1 \\ \vdots \\ -1 \end{pmatrix}, ..., X_k = \begin{pmatrix} -1 \\ -1 \\ -1 \\ \vdots \\ k-1 \end{pmatrix}. $$ Here we have $k$ variables with $\rho_{ij} = \frac{-1}{k-1}$ for all $i \neq j$. Thus $\rho = \frac{1}{k-1}$, and the bound shows we cannot have more than $k$ such variables.