Distribution of number of clusters after randomly choosing a subset of points

157 Views Asked by At

Suppose we are given a set of points $S=\{x_i\}_{i=1,...,n}$ with $x_i\in\mathbb{R}$ and $x_i<x_{i+1}$. We define a cluster to be a chain of points in $S$ where the distance between consecutive points is less than one, i.e. $x_{i+1}-x_i < 1$. Suppose that $k$ points ($0\leq k < n$) have been removed at random, but that we do not know which points these were. What is the probability distribution $\mathrm{P}(m|k,S)$ of the number of clusters $m$ in the set of remaining points? Is there a way to compute this probability without simply trying every subset of $n-k$ points? All points are equally likely to be removed.


My progress so far:

Consider the pairwise distance matrix $\mathbf{D}$ with elements $D_{i,j}=x_i-x_j$. The number of clusters can be simply calculated as one plus the number of elements on the first diagonal that are greater than one. The act of removing point $x_j$ is equivalent to deleting row $j$ and column $j$ of $\mathbf{D}$. I have created an illustration of this abstraction here, where I show the pairwise distance matrix of twenty points with every cell coloured by whether it is less than (blue) or greater to one (red). The inset shows the twenty points, coloured by which cluster they belong to. By looking at the number of red cells on the first diagonal (and adding one), we can see that there are four clusters. I think this abstraction is useful, because we can immediately see that points $\{6,7,8,9\}$ are in an indivisable cluster, i.e. all the points are within a distance of one of each other. We can also see that the cluster $\{1,2,3,4\}$ would split in two if either points 2 or 3 were removed. Splitting a cluster can be re-expressed as removing row-and-column pairs such that a red cell moves onto the first diagonal, but note that cell $(i+n,i)$ can only end up on the first diagonal if points $i+1,...,i+n-1$ are removed. I'm now trying to use this puzzle to build intuition for the problem above.

This has been cross-posted on Reddit, where a discussion is underway and a reasonable solution found.