Does a data-dependent sampling rule induce correlation?

99 Views Asked by At

I'm struggling to understand whether a data stream sliced up in a certain way could produce two quantities that are dependent but uncorrelated.

Suppose I have two iid streams of data that are independent of each other: $X = (X_1, X_2, \ldots)$ and $Y = (Y_1, Y_2, \ldots)$. I want to estimate the difference in means between the two groups. Between the two streams, I want to sample a total of $n$ points. For notation's sake, say I'm sampling one point per unit of time for $T$ total time units.

Now consider the following sampling scheme which divides up the $T$ time period into two halves:

  • Up until time $t = T/2$, sample from $X$ and $Y$ with equal probability.
  • From $t = (T/2+1)$ until $T$, sample from $X$ with probability $p$ and from $Y$ with probability $1-p$, where $p$ is some function of the data I observed in the first half and also $p \in (0,1)$.

Now consider $\hat{\theta}_1 := \bar{X}_1 - \bar{Y}_1$, the difference in sample means calculated from only the data collected up until time $t=T/2$ and $\hat{\theta}_2 := \bar{X}_2 - \bar{Y}_2$ calculated from only the data collected from time $t=(T/2+1)$ to $t=T$.

Question: Without knowing more about how $p$ depends on the data in the first half, can we tell whether $\hat{\theta}_1$ and $\hat{\theta}_2$ are correlated?

Obviously, $\hat{\theta}_1$ and $\hat{\theta}_2$ are not independent, but nevertheless I thought they would be uncorrelated. My reasoning was that the dependence of $p$ only affects the allocation between $X$ and $Y$, and doesn't introduce any bias as far as the expected value of $\bar{X} - \bar{Y}$. I feel like I oversimplified this, but I'm a bit stuck as to how to work this out rigorously.

EDIT: As an answer below pointed out, this problem may be more interesting if we restrict $p \in (0,1)$. Or to put it another way, if we require $\bar{X}_1, \bar{X}_2, \bar{Y}_1$ and $\bar{Y}_2$ to all have nonzero probability of containing points. Edit made above.

3

There are 3 best solutions below

5
On BEST ANSWER

So I thought about my own question for a bit and realized that the answer is totally straightforward.

If $p$ is bounded away from 0 and 1 then $E(\hat{\theta_2} | \hat{\theta_1}) = E(\hat{\theta_2})$ since the samples are iid across the two halves and the dependence only changes the sampling probability between the halves. Then we can simply calculate

$$ E(\hat{\theta}_1 \cdot \hat{\theta}_2) = E(E(\hat{\theta}_1 \cdot \hat{\theta}_2 | \hat{\theta}_1)) = E(\hat{\theta}_1 \cdot E(\hat{\theta}_2 | \hat{\theta}_1) = E(\hat{\theta}_1) \cdot E(\hat{\theta}_2) $$

And therefore the covariance between $\hat{\theta}_1$ and $\hat{\theta}_2$ is zero.

2
On

Without knowing more about how $p$ depends on the data in the first half we can't tell much of anything. The two variables $\hat{\theta}_1$ and $\hat{\theta}_2$ might be correlated, and they might not. They might be independent, and they might not (though no random variables can be both independent and correlated).

  • It's perfectly reasonable for our method of selecting $p$ to be always setting $p=\frac12$. Determinism is a subset of indeterminism, and it's a nice source of examples and counter-examples in probability and statistics. In this case, we clearly have independence of the variables, and since we have independence they aren't correlated.

    Update: If one wanted a stochastic choice of $p$, that would be fine too here. Any stochastic choice of $p$ that doesn't depend on $\hat{\theta}_1$ would have the same effect. For more fun, with carefully chosen $X_i$ and $Y_i$ you can actually make it so that $\hat{\theta}_1$ and $\hat{\theta}_2$ are independent even when $p$ explicitly depends on $\hat{\theta}_1$. One example of such a choice happens with a stream of length $2$, with all the $X_i$ i.i.d uniform on $[0,1]$ and all the $Y_i$ i.i.d on $[-1,0]$. The conditional probability that $\hat{\theta}_1$ was drawn from $X$ given that we observe it to be any particular value is always $\frac12$, and that suffices to give independence.

  • In other examples, it's feasible to have dependence between the variables. Suppose the $X_i$ and $Y_i$ are all i.i.d, and in fact just suppose they're uniform on $[0,1]$ (denote this $U(0,1)$). Our data stream is going to have two elements so that $\hat{\theta}_1$ is based only on the first element and $\hat{\theta}_2$ is based solely on the second element. The rule we're going to use for selecting $p$ is that $p=1$ if $\hat{\theta}_1>0$ and $p=0$ otherwise. In this extremely contrived example (and other more complicated examples too...just making a point without muddying up the waters), we'll be able to show the variables are correlated.

    In particular, note that $\hat{\theta}_1$ is actually just $U(-1,1)$, and $\hat{\theta}_2$ is actually just $\text{sgn}(\hat{\theta}_1)U(0,1)$, where the signum function is equal to $-1$ for negative values and $1$ for positive values, and where we really don't care about its value at $0$ because the set $\{0\}$ has measure $0$ and doesn't affect our probabilities. Sparing you the busywork (it's a good exercise), the population Pearson correlation coefficient between $\hat{\theta}_1$ and $\hat{\theta}_2$ in this case is $\frac18$ -- which we sort of expected since $\hat{\theta}_2$ is non-negative when $\hat{\theta}_1$ is non-negative and $\hat{\theta}_2$ is non-positive when $\hat{\theta}_1$ is non-positive.

    Update: Choosing $p=0,1$ was just a tool to make computing the correlations easy. The correlation of $\frac18$ we observed varies smoothly as a bivariate function of our two separate choices of $p$ ($1$ when $\hat{\theta}_1$ is positive and $0$ otherwise) and can take on any value from $-\frac18$ to $\frac18$. Except for extraordinarily bad choices of $p$ (I think the only inflections happen when $p$ is the same for all values of $\hat{\theta}_1$) this correlation is still non-zero.

  • Even when we have a dependence between the variables, it's perfectly feasible for them to not be correlated with suitably chosen parameters. Keep the 2-item data stream from the last example. Let the $X_i$ be i.i.d uniform on the set $[-1,1]$, and let the $Y_i$ be i.i.d uniform on the set $[-2,-1]\cup[1,2]$. Select $p=1$ whenever $\hat{\theta}_1\in[-1,1]$ and $p=0$ otherwise.

    This has dependence just like in the last example for basically the same reason. The main difference is in the symmetry of the sets in question and in how we select $p$. When we sample $\hat{\theta}_1$ and $\hat{\theta}_2$, the result is either two i.i.d samples from $U(-1,1)$ or two i.i.d samples from the uniform distribution on $[-2,-1]\cup[1,2]$. Using that fact we can quickly compute that the population Pearson correlation coefficient between these two random variables is $0$.

    Update: As in the last example, the reliance on crisp selections of $p$ was just out of laziness. It turns out that any selection of $p$ in this problem actually still leaves you with a population Pearson correlation coefficient of $0$ since all the $X_i$ and $Y_i$ separately have a correlation of $0$. That's easier to argue with crisp selections of $p$ since we're strictly considering i.i.d distributions, but it's a fact that any two independent uniform distributions symmetrically distributed around $0$ with positive variance are uncorrelated.

0
On

If it is guaranteed that the samples are large enough that there is at least one $X$ and one $Y$ chosen in each sample, the covariance $V[\hat{\theta}_1\hat{\theta}_2]$ is shown below to be zero, so these random variables are not correlated.

$$E[\hat{\theta}_1]=E[\bar{X}_1]-E[\bar{Y}_1]=\mu_x-\mu_y$$

since $$E[\bar{X}_1]= E\left[\frac{1}{n_{x1}}\sum_{i=1}^{n_{x1}}X_{1i}\right]=\frac{n_{x1}}{n_{x1}}E[X]=\mu_x$$

The fact that $n_{x1}$ is an outcome of a random variable does not change the calculation, provided $P(n_{x1}=0)=0$.

Likewise
$$E[\hat{\theta}_1\hat{\theta}_2]=E[\bar{X}_1\bar{X}_2-\bar{X}_1\bar{Y}_2-\bar{Y}_1\bar{X}_2+\bar{Y}_1\bar{Y}_2]=\mu_x^2-2\mu_x\mu_y+\mu_y^2$$

since the draws are independent.

So,

$$V[\hat{\theta}_1\hat{\theta}_2]=E[\hat{\theta}_1\hat{\theta}_2]-E[\hat{\theta}_1]E[\hat{\theta}_2]=0$$

It is not clear if there is interest in including the situation where some samples have no entries from either $X$ or $Y$. It is only because of that possibility that covariance would be introduced between $\hat{\theta}_1$ and $\hat{\theta}_2$.

Assume that $p_{x1}$ is the probability for their to be at least one $X$ in sample 1, then

$$E[\hat{\theta}_1]=E[\bar{X}_1]-E[\bar{Y}_1]=p_{x1}\mu_x-p_{y1}\mu_y$$

but it in general

$$E[\bar{X}_1\bar{X}_2]\ne p_{x1}p_{x2}\mu_x^2$$

so the covariance would not be zero in the general case.