Estimating the expected Size of the smallest of two clusters.

50 Views Asked by At

We have 100 total points, distributed into two clusters (A and B) such that the probability is equally distributed. This means that any point can be the part of any cluster with equal chance.

Now what is the Expected Size of the Smallest of both Clusters. E[min(size of A,size of B)]?

To start off i know from this statement that maximum size of the smallest cluster is 49.

2

There are 2 best solutions below

1
On

This is a rough sketch, using many 'well known' approximations that come in handy here. Let's generalize this problem for any $2k$ elements, so $k=50$ is your case.

The probability that cluster A has exactly $i$ elements ($0 \le i \le 2k$) is $$P(c_A=i) = \frac{2k \choose i}{2^{2k}}$$.

In that case, the size of the smallest cluster is $i$, if $i \le k$ and $2k-i$, if $i > k$. This gives the following expected value

$$E[\min(|A|,|B|)] = \frac{\sum_{i=0}^k i{2k \choose i} + \sum_{i=k+1}^{2k} (2k-i){2k \choose i}}{2^{2k}}$$

Considering ${2k \choose i} = {2k \choose 2k-i}$ the second sum can be rewritten as $$\sum_{i=k+1}^{2k} (2k-i){2k \choose i} = \sum_{i=k+1}^{2k} (2k-i){2k \choose 2k-i} = \sum_{i=0}^{k-1} i{2k \choose i}$$. This gives

$$E[\min(|A|,|B|)] = \frac{k{2k \choose k} + 2\sum_{i=0}^{k-1} i{2k \choose i}}{2^{2k}}$$

Using Combinatorial identity $i{n \choose i } =n {n-1 \choose i - 1}$, we get

$$E[\min(|A|,|B|)] = \frac{k{2k \choose k} + 2\sum_{i=0}^{k-1} (2k){2k-1 \choose i-1} }{2^{2k}} = \frac{k{2k \choose k} + 4k\sum_{i=0}^{k-1} {2k-1 \choose i-1} }{2^{2k}} = \frac{k{2k \choose k} + 4k(1 + \sum_{i=0}^{k-2} {2k-1 \choose i})}{2^{2k}}$$

The single remaining sum is now 'almost' half of the sum $\sum_{i=0}^{2k-1}{2k-1 \choose i} = 2^{2k-1}$. The term missing to get exactly the half is ${2k-1 \choose k-1}$, but this term is of lesser order than the sum, so we can omit here (and also the "+1"):

$$E[\min(|A|,|B|)] \approx \frac{k{2k \choose k} + 4k\frac{2^{2k-1}}{2}}{2^{2k}} = \frac{k{2k \choose k} + k2^{2k}}{2^{2k}}$$.

Again, $2k \choose k$ is of lesser order than $2^{2k}$, so we get finally

$$E[\min(|A|,|B|)] \approx k$$

Of course, this is suprising, as $k$ is the maximal possible value for the minimum, but of course for big $k$ a situation where A and B both get 'roughly half the points' is very, very likely.

Now, using Python to evaluate the exact formula for $k=50$ yields an expected value of $46.0...$ As usual, different problems have different levels of how fast the infinite behaviour happens. Based in the 'omitted' items of the sum, I'd assume that $k - c\sqrt{k}$ for some $c$ would be a better approximation.

2
On

To pick up the ball where Ingix left it, we have

$$ E(\min (|A|, |B|)) = {k {2k \choose k} + 4k (1 + \sum_{i=0}^{k-2} {2k-1 \choose i}) \over 2^{2k}} $$

Let's look at that sum $\sum_{i=0}^{k-2} {2k-1 \choose i}$. Call it $S(k)$. Observe that

$$ 2^{2k-1} = {2k-1 \choose 0} + {2k - 1 \choose 1} + \cdots + {2k - 1 \choose 2k-1}. $$

and by symmetry

$$ 2^{2k-1} = 2 \left( {2k-1 \choose 0} + {2k-1 \choose 1} + \cdots + {2k-1 \choose k-1} \right) = 2 \left( S(k) + {2k-1 \choose k-1} \right). $$

So $S(k) = 2^{2k-2} - {2k-1 \choose k-1}$. Plugging that back in, we get

$$ E(\min (|A|, |B|)) = {k {2k \choose k} + 4k \left( 1 + 2^{2k-2} - {2k-1 \choose k-1} \right) \over 2^{2k}}. $$

Now observe that ${2k \choose k} = 2 {2k-1 \choose k-1}$. So

$$ E(\min (|A|, |B|)) = {k {2k \choose k} + 4k \left( 1 + 2^{2k-2} - {1 \over 2} {2k \choose k } \right) \over 2^{2k}}. $$

We can now simplify a bit to get

$$ E(\min (|A|, |B|)) = {-k {2k \choose k} + 4k + k 2^{2k} \over 2^{2k}} $$

and finally we can get a nice expression for the difference between that expectation and $k$:

$$ k - E(\min (|A|, |B|) = {k {2k \choose k} \over 4^k} - {4k \over 2^{2k}}. $$

At this point we can use the standard approximations ${2k \choose k} \sim 4^k/\sqrt{\pi k}$ (derived, for example, from Stirling's approximation) to get

$$ k - E(\min (|A|, |B|) \sim {k \over \sqrt{\pi k}} = {\sqrt{k \over \pi}}. $$