Estimating the number of parts in a partition of a set

135 Views Asked by At

In trying to solve a problem for a project I am working on, I thought about a way to use estimation to make one of its components easier. I do not know much about probability since it is not the crux of the project, so I have some questions.

Say I have some set $S$ which is partitioned into $n$ disjoint subsets that span $S$. I do not know this $n$, but I am trying to find it. For a given $s \in S$, I can tell you how many other elements of $S$ are in the same partition as $s$ is. Thus, my idea was to randomly choose elements of $S$ and use the sizes of the relevant partitions to estimate the number of partitions.

First I thought to simply average the sizes of partitions to find the average partition size, and then divide $|S|$ by this average size, but this does not seem to work:

$$S = \{(1,2,3,4,5,6,7),(8,9,10)\}$$

There are clearly two partitions of $S$, but the above method would estimate that there are fewer. Since the first partition is bigger, we are more likely to randomly choose samples from it, so we are inclined to overestimate the average partition size, and so underestimate the number of subsets.

Is there a way we can adjust for the fact that bigger partitions will be overrepresented? Please keep in mind that we do not know a priori the number of partitioning subsets or their sizes.

1

There are 1 best solutions below

0
On BEST ANSWER

This depends on how the partition of $S$ was constructed. For this I’ll assume you create $|S|$ many empty sets (this is the largest $n$ can be) and randomly assign each element of $S$ to them one by one, and at the end discard those that are still empty. You should compare this to the method of first randomly selecting the size of the partition, then assigning elements so that each set in the partition is guaranteed to have at least one element. These will be different on average.

After you’ve randomly selected your elements, let $p$ be the number of sets you’ve accounted for, and $m$ be the number of elements you’ve accounted for (for example, if you only look at 1 element and it is in a set with 10 elements total, you would have $p=1$ and $m=10$). You know that the remaining $|S|-m$ elements can’t be in any of the $p$ sets you’ve found, and you also know that there can be at most $|S|$ sets in the partition. Because elements of $S$ were assigned to elements of the partition independently of each other, we need to calculate the expected number of non-empty sets when $|S|-m$ elements are randomly assigned to $|S|-p$ sets (If you don’t understand why this is, think of it as follows: you’re partway through constructing a partition of $S$ and you’ve already assigned $m$ elements to $p$ sets, and want to assign the rest of the elements to other sets. On average how many more sets will this use?)

To calculate this, we first need to calculate the probability that $k$ sets are nonempty for any given $k\leq |S|-m$. For simplicity, let $u=|S|-p$ (the number of sets left) and let $v=|S|-m$ (the number of elements left).

For $k=1$, the first element has a 100% chance of going into the correct set (because it doesn’t actually matter which set they all go in to), and each element after that has a $1/u$ chance of going into the same set, so overall the chance is $1\cdot(1/u)^{v-1}$.

For $k=2$, the first element again has a 100% chance of being correct, but we need the second element to go in a different set to make sure exactly 2 sets are non-empty, which has a $(u-1)/u$ chance. Each remaining element has a $2/u$ chance to be in one of those sets, so overall the probability is $1\cdot(u-1)/u\cdot(2/u)^{v-2}$.

Continuing like this, we see that for any $k\leq v$, the chance of exactly $k$ sets being non-empty is (after a little rearrangement) $$\frac{(u-1)!}{(u-k)!}\cdot k^{v-k}\cdot\frac{1}{u^{v-1}}$$

To calculate the expected value, we multiply this by $k$ and sum it from $k=1$ to $v$ $$\frac{(u-1)!}{u^{v-1}}\cdot\sum_{k=1}^v\frac{k^{v-k+1}}{(u-k)!}$$

To find the expected value of $n$, we need to add $p$ to this sum. Rewriting this to replace $u$ and $v$ with the original variables, we get the following formula to estimate $n$

$$n\approx p+\frac{(|S|-p-1)!}{(|S|-p)^{|S|-m-1}}\cdot\sum_{k=1}^{|S|-m}\frac{k^{|S|-m-k+1}}{(|S|-p-k)!}$$

I’m not sure if this fully accounts for the higher probability of selecting elements from larger sets, but it should at least be more accurate than the method you describe. I’ll try to update this when I get a chance to work on it more.