Algorithm for partitioning a vector into "similar" subsets

28 Views Asked by At

I have to code an algorithm that optimally "partitions" a data vector into a number of $K$ subsets.

In particular, the input data is a vector $X = (x_1,x_2,...,x_N)$, where $x_n\in\{0,1\}$. A partition should group consecutive values of $x_n$ together if they are similar. For example, if the data set is $X=(0,0,0,1,1,1)$ and $K=2$, then the optimal partition creates the subsets $S_1 = (0,0,0)$ and $S_2 = (1,1,1)$.

The way I implemented this at the moment is that I define the following likelihood function: Denote by $p_k$ the proportion of $x_n=1$ among all $x_n$ in the $k$th subset of the data set. The likelihood function is equal to

$$ \Pi_{k=1}^K p_k^{\sum_{s_n\in S_k} s_n}(1-p_k)^{\sum_{s_n\in S_k} (1-s_n)} $$

For example, if $X = (0,0,0,1,1,0)$, and $K=2$, the maximum likelihood estimator is $S_1 = (0,0,0)$, $S_2=(1,1,0)$, $p_1=0$, $p_2=2/3$.

The way I calculate the maximum likelihood estimator at the moment is by brute force: For any possible partition of $X$, calculate the optimal $p_k$'s, calculate the likelihood value, and pick the one with the highest value. This is computationally demanding if $K$ and $N$ are large.

I have been thinking of alternative approaches to determine the optimal partition. One hypothesis I have at the moment is that the partitions of $X$ under a higher $K$ are always subsets of the partitions under lower $K$. In the example above, where $X=(0,0,0,1,1,0)$, the optimal partition under $K=3$, for example is $S_1=(0,0,0)$, $S_2=(1,1)$ and $S_3=(0)$. In other words, increasing $K$ by one will split up one subset of the optimal partition under $K$, leaving the rest unchanged.

If this were true, it would make the computations much more efficient. But I am not sure whether it is.

1

There are 1 best solutions below

1
On

You are correct. The optimal partitions under a higher K is actually the subset of the partitions under lower K. This property is called the nestedness property.