Picking a uniformly random subset of colored balls: moments of the discrepancy between colors?

91 Views Asked by At

Let $N \gg k\geq 1$, and $\gamma \in (0,1/2)$. I have a set of $N$ colored balls (3 colors in total), where $N$ is a multiple of

  • $\gamma N$ are red;
  • $\gamma N$ are blue;
  • $(1-2\gamma)N$ are green.

I pick a random subset $S$ of $[N]$ of size $k$, and would like to know the expectation and variance of the quantity $$ (X-Y)^2 $$ where $X$ (resp. $Y$) is the number of red (resp. blue) balls. (Ideally, the full distribution of $X-Y$ would be nice, but the above is all I really need, I think).

Now, this does feel an awful lot like a variant of balls-and-bins, or basically of something I should know. But every attempt I have made today either relies on (a) the (not very justified) approximation of independence (i.e., sampling with replacement instead of without); or (b) horrible looking sums involving binomial coefficients.

Does the problem above have a name, or a long history that I am missing because of not knowing the right keywords? Or, if not, is there a very simple analysis of the two quantities mentioned above?

2

There are 2 best solutions below

0
On BEST ANSWER

Two possible techniques.


First technique: use other people's work. The distribution in question appears to be, from @BGM's comment, a multivariate hypergeometric distribution, which is a very enjoyable name to pronounce very quickly 5 times in a row.

More precisely, $(X,Y,k)\sim \operatorname{MultivHypergeom}_3(\underbrace{(\gamma N,\gamma N,(1-2\gamma) N)}_{(K_1,K_2,K_3)}, N, k)$, from which we get

$$ \mathbb{E}X = \mathbb{E}Y = k\frac{\gamma N}{N} = \gamma k \tag{1} $$ $$ \operatorname{Var} X = \operatorname{Var} Y = \frac{\gamma N}{N}\left(1-\frac{\gamma N}{N}\right)k\frac{N-k}{N-1}= k\gamma\left(1-\gamma\right)\frac{N-k}{N-1} \tag{2} $$ and $$ \operatorname{Cov} (X,Y) = -k\frac{K_1K_2}{N^2}\frac{N-k}{N-1}= -k\gamma^2\frac{N-k}{N-1} \tag{3} $$ from which we can compute $$\begin{align} \mathbb{E}[(X-Y)^2] &= \operatorname{Var} (X-Y) + \mathbb{E}[X-Y]^2 = \operatorname{Var} (X-Y) = \operatorname{Var} X + \operatorname{Var} Y - 2\operatorname{Cov} (X,Y) \\ &= 2\left( \operatorname{Var} X - \operatorname{Cov} (X,Y) \right) = 2\left( k\gamma\left(1-\gamma\right)\frac{N-k}{N-1} + k\gamma^2\frac{N-k}{N-1} \right) \\ &= 2k\gamma \frac{N-k}{N-1}. \end{align}$$ Which, intuitively, seems to make sense (something around $2k\gamma$, plus a correction due to the lack of independence). I expect (no pun intended) one can also leverage this approach to obtain the expression of $\mathbb{E}[(X-Y)^4]$.


Second technique: use someone's ideas. Basically, @Slowpoke's answer, with the following fix: $\mathbb{E}[X\mid X+Y]$ is a random variable distributed according to a hypergeometric distribution, not binomial. That is, we have $$ \mathbb{E}[X\mid X+Y] \sim \operatorname{Hypergeom}(X+Y, \gamma N, 2\gamma N) $$ where $U\stackrel{\rm def}{=}X+Y\sim \operatorname{Hypergeom}(k, 2\gamma N, N)$ from which we can try to compute $$ \mathbb{E}[(X-Y)^2] = \mathbb{E}[\mathbb{E}[(X-Y)^2\mid U]] = \mathbb{E}[\mathbb{E}[(2X-U)^2\mid U]] $$ and $$ \mathbb{E}[(X-Y)^4] = \mathbb{E}[\mathbb{E}[(X-Y)^4\mid U]] = \mathbb{E}[\mathbb{E}[(2X-U)^4\mid U]]. $$ This is a bit tedious, but doable, and e.g. using Mathematica:

Expectation[ Expectation[(2 X - U)^2, {X \[Distributed] HypergeometricDistribution[U, a*N, 2 a*N]}], {U \[Distributed] HypergeometricDistribution[k, 2*a*N, N]}]
Expectation[ Expectation[(2 X - U)^4, {X \[Distributed] HypergeometricDistribution[U, a*N, 2 a*N]}], {U \[Distributed] HypergeometricDistribution[k, 2*a*N, N]}]

we obtain $$ \begin{align} \mathbb{E}[(X-Y)^2] &= 2k\gamma \frac{N-k}{N-1} \operatorname*{\sim}_{N\to\infty} 2k\gamma \\ \mathbb{E}[(X-Y)^4] &= \frac{(2 \gamma k (k - N) ((-1 + 6 \gamma (N-1) - N) N + 6 k^2 (\gamma N-1) - 6 k N (\gamma N-1)))}{(N-3) (N-2) (N-1)} \\ &\operatorname*{\sim}_{N\to\infty} 12 \gamma^2 k^2 + 2(1 - 6 \gamma) \gamma k \end{align} $$


6
On

I've got a strange result for expectation, so feel free to downvote it:

At first, lets split $k$ balls taken into two parts - one with red and blue balls, and other - with green balls. So we get the random variable $U$:

$$P(U=u) = Hypergeometric(u, k,2\gamma N,N) $$

which means that drawing $k$ balls from the box with $N$ balls that contains $2\gamma N$ red and blue balls, without replacement, we get exactly $u$ red and blue balls ($1-u$ green balls).

The second is to pick up red ball ($X$) among $u$ red and blue balls which has binomial probability $\frac{1}{2}$ (I suppose we have already accounted all different combinations inside $u$ balls by not separating them on the previous step):

$$P(X=x \ | \ U=u) = Bin(x, u, \frac{1}{2})$$

Now we want to calculate expectation $$E[(X-Y)^2]= E[(2X-U)^2]=E[E[(2X-U)^2 \ | \ U]]$$

The inner expectation is:

$$E[(2X-U)^2 \ | \ U=u] =4E[X^2 \ | \ U=u]-4uE[X \ | \ U=u] + u^2$$

Expectation of binomial distribution is $np$, variance is $np(1-p)$, so the second moment is $np(1-p) + (np)^2$. Finally we get ($n=u, p=\frac{1}{2}$):

$$4E[X^2 \ | \ U=u]-4uE[X \ | \ U=u] + u^2 = 4\frac{u+u^2}{4} - 4u\frac{u}{2} + u^2 = u.$$

(Strange...)

Well, it is just a hypergeometric expectation then:

$$E[U] = k \frac{2\gamma N}{N} = k 2\gamma. $$


So I wrote a program in R to check the formula you derived. We have 2 balls of each type and take $k=3$ balls. The program takes samples, calculates $val=(X-Y)^2$ and then evaluates expectation as $\dfrac{\sum_{i=1}^{1000000}val_i}{1000000} $:

myset<-c("r","r","b","b","g","g")

N<-length(myset)

k<-3

sample_from_myset<-function(myset){
    s<-sample(myset,k)

    val<-(length(which(s=="r")) - length(which(s=="b")))^2
}

eval<-replicate(1000000,sample_from_myset(myset))
eval<-sum(eval)/length(eval)
eval

2*k*(2/N)*(N-k)/(N-1)

The result 1.198524 is very close to the theoretical that is derived from your formula: $$2 \frac{2}{6}3 \frac{6-3}{6-1} =\frac{6}{5} = 1.2$$

Testing with other initial set ant $k$ also shows close results. Thus, I suppose that your formula is correct (and mine isn't)