Simple Algorithm for Finding Clusters in Matrix

220 Views Asked by At

Let's say I have a simple 4x4 matrix which describes a set of clusters. If $A_{ij}=0$ then node $i$ and node $j$ are in the same cluster. For example: $$ A = \left(\begin{array}{cccc} 0 & 0 & 0 & 1\\ 0 & 0 & 0 & 1\\ 0 & 0 & 0 & 1\\ 1 & 1 & 1 & 0 \end{array}\right) $$

would imply that nodes $1$, $2$, and $3$ are in a cluster together and node $4$ is in a cluster by itself.

Now lets say there is a slight mistake in the matrix where:

$$ A = \left(\begin{array}{cccc} 0 & 0 & 1 & 1\\ 0 & 0 & 0 & 1\\ 0 & 0 & 0 & 1\\ 1 & 1 & 1 & 0 \end{array}\right) $$

Now we have that node $1$ and $2$ should be in the same cluster, node $2$ and $3$ should be in the same cluster, but not node $1$ and $3$.

In the first case, I would just compare equivalency of the rows, but in the latter, now I need something a bit smarter. Any simple solutions in MatLab to make sure that nodes $1$, $2$, and $3$ all make it in the same cluster?

1

There are 1 best solutions below

0
On

The zero entries of given matrix $A$ form (we assume) a symmetric relation, but not necessarily reflexive or transitive one. Those three properties are required of an equivalence relation, or what amounts to the same thing, a partition of the nodes (a family of disjoint subsets whose union is all of those nodes, which you called *clusters^).

A natural approach would be to set additional nodes of $A$ to zero in order to achieve the reflexive and transitive properties without losing the symmetric property. The reflexive property, that every individual node is related to itself, is easily imposed. Set all the diagonal entries of $A$ to zero. (Clearly this does not change the symmetric character of $A$, and we shall assume hereafter that it has already been done.)

That leaves us with just the problem of imposing the transitive property. This can usually be done in more than one way, such as the trivial approach of setting all entries of $A$ to zero. The (unique) solution which sets the fewest additional entries of $A$ is called the transitive closure, and this may well be the "something a bit smarter" that you want.

Computationally an easy way to do this is by matrix multiplication, which accords well with your desire to do a Matlab implementation. First take the complement of each entry in $A$, mapping one entries to zeros and zero entries to ones. We will call this matrix $B$.

Now we perform repeated multiplications of $B$, together with setting any nonzero entries that appear to one. That is one matrix multiplication per step, followed by truncating the entries:

$$ B_1 = B\, ; B_{k+1} = f(B \times B_k) $$

where $f$ is the operation of setting nonzero entries to $1$.

$B_k$ is the identity matrix plus the adjacency matrix of a simple graph, namely the graph whose edges connect a pair of distinct nodes which can be reached in $k$ or fewer steps using the original relation between nodes.

The sequence of matrices $B_k$ eventually stabilizes, and actually $B_n = B_{n-1}$ where $n$ is the matrix size. So we check when computing the next matrix to see if any additional nonzero entries have been introduced, and we stop whenever there are no new $1$ entries.

It will likely be expeditious to take steps by squaring $B_{2k} = f(B_k^2)$ the matrices. The termination criterion then becomes $B_{2k} = B_k$. This still requires only one matrix multiplication per step, and typically only logarithmically as many steps. So the computational cost is reduced.

Finally one complements the converged matrix $B_k$ by turning $1$ entries to zeros and $0$ entries to ones, and that gives us the "corrected" matrix $A$ you want.