Counting samples that are different in the given percentage of elements

36 Views Asked by At

Suppose there are n unique elements. A mix is a set of m elements pulled from the elements without replacement, so that there are $ C^{m}_{n} $ possible mixes in total. Suppose one has all such mixes. Let call a mix to be distinct from some others if, at least, 50% of its elements are not present in any other mix (pairwise comparison of mizes).

How many of $ C^{m}_{n} $ mixes are distinct mixes? Or, in other words, let N be a set of mixes. Lets call N a valid mix, if all mixes therein are pairwise distinct. How many mixes are in the largest possible valid set?

I will be happy with rough estimation for large n and m (say, $10^5$ and $10^{4}$ correspondingly), but, preferably, without assuming n>>m. All methods I tried seem to heavily underestimate the number, and counting them by brutal force on computer takes too much time.

1

There are 1 best solutions below

1
On BEST ANSWER

Let the set of mixes be $\{M_1,\dots,M_{\binom{n}{m}}\}$. Each mix-set $S_j$ can contain some subset of these mixes. There are $2^{\binom{n}{m}}$ possible mix-sets $\{S_j\}$. Some of these are valid. We wish to find $max(\enspace|S_j|\enspace \big| S_j \text{ is valid})$.

We can draw a large graph to represent the possible mixes as follows. Let each node represent a mix. We draw an edge from $M_i$ to $M_j$, if they differ by more than 50% of elements.

Now it's easy to see that your problem is to find the size of the largest clique in this graph. Since you are looking for an approximation for large $m,n$, one can find this by assuming a random graph, by

  1. Finding the probability of an edge $p_e$ between any two vertices
  2. Given $p_e$, and number of nodes $\binom{n}{m}$, finding bounds for the largest clique in this random graph.

We can calculate $p_e$ as follows. Fix some node, ie some mix of m elements from n. Then to see nodes that differ from this by half the elements, we can choose $m$ elements from at most $n-\frac{m}{2}$ elements. Thus the probability of an edge is $$p_e = \frac{\binom{n-\frac{m}{2}}{m}}{\binom{n}{m}}$$ Note that this works, irrespective of which fraction of $\frac{m}{2}$ elements we exclude

For the second part, we have the following theorem: For $G\in\mathcal{G}_{n,p}$, for any $p\in (0,1)$, the clique number is almost surely $\sim 2log_{\frac{1}{p}}(n)$

Then as per the theorem, our clique number is \begin{align*} C &= 2log_{\frac{1}{p_e}}\left(\binom{n}{m}\right)\\ &=2\frac{log\left(\binom{n}{m}\right)}{log(\frac{1}{p_e})}\\ &=2\frac{log\left(\binom{n}{m}\right)}{log\left(\frac{\binom{n}{m}}{\binom{n-\frac{m}{2}}{m}}\right)}\\ &=\frac{2}{1-k} \end{align*}

Where $k = log(\binom{n-\frac{m}{2}}{m})/log(\binom{n}{m})$.


There are many ways to approximate k. For example, see here. But in all such cases, one notes that $k\sim 1$ for the kind of $m,n$ values you mention.