Distance between two partitions of a set

1.2k Views Asked by At

I want to compare two different partitions of a set consisting of $n$ elements: ${1,2,3,...,n}$. How can I count the number of elements that are in different subsets when comparing partition 1 with partition 2?

Examples with $n=6$

Partition 1: $\{\{1,2,3\},\{4,5,6\}\}$, partition 2: $\{\{1,2,4\},\{3,5,6\}\}$ should yield $2$ since the elements $3$ and $4$ are in different subset.

Partition 1: $\{\{1,2,3\},\{4,5,6\}\}$, partition 2: $\{\{1,2\},\{3\},\{4,5,6\}\}$ should yield $1$ since the element $3$ is in a different subset.

Partition 1: $\{\{1,2,3\},\{4,5,6\}\}$, partition 2: $\{\{1,2,3,4,5,6\}\}$ should yield $3$ since the elements $4,5,6$ are in different subsets.

Background: I want to compare two different community partitions of a network. I would like to know how many nodes I have to "move" to a different community to get from one partition to the other. In this picture you can see a possible community partition (the communities are color-coded, i.e. nodes with the same color belong to the same community).

2

There are 2 best solutions below

0
On BEST ANSWER
1
On

Try this:

  • Call the partition with the largest number of parts $P_1$, the first partition (if it's a tie, pick whichever)

  • Arbitrarily label the parts in $P_1$.

  • Starting with a largest part $p$ of the second partition $P_2$, find a part $p'$ of $P_1$ so that $p \cap p'$ is as large as possible (if there is a tie, break it however). Label $p$ with the same label as $p'$

  • Repeat the preceding step until all of the parts in the second partition are labeled, never reusing labels within $P_2$.

This won't in general be unique.

For example: $$ \text{Partition 1: } \{\underbrace{\{1,2\}}_{A},\underbrace{\{3\}}_{B},\underbrace{\{4,5,6\}}_{C}\} \qquad \text{Partition 2: } \{\underbrace{\{1,2,3\}}_{A},\underbrace{\{4,5,6\}}_{C}\} $$

Then, you can think of each partition as a bunch of orderd pairs:

$P_1 = (1, A), (2, A), (3, B), (4, C), (5, C), (6, C)$ for the first, and

$P_2 = (1, A), (2, A), (3, A), (4, C), (5, C), (6, C)$ for the second.

Viewing these partitions as the set of ordered pairs, then the $1$ just so happens to be $|P| - |P_1 \setminus P_2| = n - |P_1 \setminus P_2|$; $n$ minus the number of things belonging to sets with the same label. In general, I believe $n - |P_1 \setminus P_2|$ should count the number of differences, as you've described it. I've tried to think of a pair of partitions for which this technique won't work, but haven't been able to.

Disclaimers:

  1. I'm not sure this works, although I suspect it does. Hopefully since you have a vested interest in the problem, you can try to implement this, and see if it works as it should -- it does in all of the test cases, and one additional test case I tried: $1\ 2\ 3\ 4 \mid 5\ 6\ 7\ 8 \longrightarrow 1\ 2 \mid 3\ 4 \mid 5\ 6 \mid 7\ 8$, which (I think) is $6$ differences.
  2. The part I'm least confident with is the extent to which the labeling of parts in the second partition affects things. I said "largest first" because you want to match labels for remaining parts that are the most similar. It's possible that this "greedy approach" to labeling $P_2$ will cause something to go awry, but I don't think so.