Given probabilities $p_1, \ldots, p_n$ with a correlation structure, how can we convert these to binary values while retaining the structure?

52 Views Asked by At

Suppose we have probabilities $p_1, \ldots, p_n$ with a correlation structure. This correlation structure could have been establish with a Gaussian Coupla. I am wondering how can we convert these to binary values while retaining the structure? I have tried things like:

rbinom(n, 1, p)

in R programming where p is the vector of the above probabilities. However, this seems to underreport the bias consistently and causes it to lose the correlation structure.

Does anyone have any ideas about how to convert the probabilities into a binary vector of length $n$ without losing the correlations? Thanks.

1

There are 1 best solutions below

0
On BEST ANSWER

In my answer, I am assuming that your intention is to find variables $X_1,\ldots, X_n \in \{0,1\}$ such that

$$\text{Cor}(X_i,X_j) = \text{Cor}(P_i,P_j),$$

where $P_1,\ldots, P_n \in [0,1]$ is some dependent set of variables, and $ \text{Cor}$ denotes the correlation. That is, I focus on pairwise correlation; though I am sure the messages generalize to broader correlations.

I will focus on why your current approach underestimates the correlation.


Why your current approach is underestimating correlation

I thought it'd be useful first to point out why using independent Bernoulli variables under represents the correlation. For this we will need to use the law of total covariance, as well as the law of total variance (since correlation is a ratio of covariance divided by a product of standard deviations).

The law of total covariance says

\begin{align*} \text{Cov}(X_i,X_j) & = \mathbf{E} \left[ \text{Cov}(X_i,X_j \, | \, P_i, P_j) \right] + \text{Cov}\left( \mathbf E[X_i\,|\,P_i], \, \mathbf E[X_j\,|\,P_j] \right) \end{align*} Since we are assuming that in this instance $X_i \sim \text{Ber}(P_i)$ are independent conditioned on the value of $P_i$, then the first term above is $0$, because the covariance of independent variables is $0$. For the second term, we note that the conditional expectation $\mathbf E[X_i \, |P_i]$ has the distribution of $P_i$ (since the expectation of a $\text{Ber}(p)$ variable is $p$). Hence

\begin{align*} \text{Cov}(X_i,X_j) & = \mathbf{E} \left[ \text{Cov}(X_i,X_j \, | \, P_i, P_j) \right] + \text{Cov}\left( \mathbf E[X_i\,|\,P_i], \, \mathbf E[X_j\,|\,P_j] \right) \\ & = 0 + \text{Cov}(P_i, P_j) \\ & = \text{Cov}(P_i, P_j) \end{align*}

Now we calculate the variance (and hence standard deviation) of the $X_i$. As before we condition on the value taken by $P_i$ and use

\begin{align*} \text{Var}(X_i) &= \mathbf E \left[ \text{Var}(X_i \, | \, P_i) \right] + \text{Var}\left( \mathbf E[ X_i \, | \, P_i] \right) \\ & = \mathbf{E} \left[ P_i(1-P_i)\right] + \text{Var}(P_i) \\ & = \mathbf E[P_i] - \mathbf E[P_i^2] \\ & = \mathbf E[P_i(1-P_i)] \end{align*} where we used similar manipulations of conditional expectations as before. Note in particular that from the second line we have $$ \text{Var}(X_i) \geq \text{Var}(P_i),$$ from which it follows that $\sigma_{X_i} \geq \sigma_{P_i}$, ande hence $$\text{Cor}(X_i,X_j) = \frac{\text{Cov}(X_i,X_j)}{\sigma_{X_i} \sigma_{X_j} } = \frac{\text{Cov}(P_i,P_j)}{\sigma_{X_i} \sigma_{X_j} } \leq \frac{\text{Cov}(P_i,P_j)}{\sigma_{P_i} \sigma_{P_j} } = \text{Cor}(P_i,P_j) $$