Best vector distance measure for contrasting clustering?

145 Views Asked by At

Let say i have V-vectors, everyone with size N, which measure of distance should I use if i want to cluster them by the following criteria : the more elements vector has in common with other vectors of the cluster and the fewer elements in common with members of the contrasting clusters.

Ideally the number of clusters are dynamic and the vectors are added one-by one. But I don't think this have impact on the distance measure.


I found Tversky contrast distance ! It seems to be a Set distance metric, whatever that means. On the other hand the formula looks nice.

d(x,y) = a * f(x intersect y) - b * f(x-y) - c * f(y-x)

in binary :

d(x,y) = a * count(x&y) - b * count(x&~y) - c * count(~x&y)

1

There are 1 best solutions below

4
On

It's not clear if your "vector" is an unordered set of $N$ discrete objects, but that's what it seems like. You can simply define a distance metric e.g. like: $$ d(x,y) = \left[\gamma+ \sum_i \sum_j L[x_i,y_j] \right]^{-2} $$ where $\gamma\in\mathbb{R}^+$ and $$ L[x_i,y_j] = \begin{cases} 1 &\text{if} \;\;x_i = y_j\\ 0 & \text{otherwise} \end{cases} $$ Or if you view each $x$ as a set, you can use something like: $$ \tilde{d}(x,y) = \left( \gamma + \frac{1}{N}\sum_k {1}_{x_k\in y} \right)^{-1} $$

Then run pretty much any clustering algorithm using this metric.

As for the clustering being dynamic, you basically have to use a clustering algorithm that does not take the number of clusters as inputs. Algorithms like mean shift, affinity propagation, dbscan, or a Bayesian Gaussian mixture with a Dirichlet prior. You can get coordinates, if desired, by using the distance matrix $D$, where $D_{ij} = d(x_i,x_j)$, with an embedding algorithm like multidimensional scaling.


These two questions ([1] and [2]) are related. Some nice papers I saw that can do dynamic estimation of the number of clusters with online updates are:

  • Sequential clustering with particle filtering - Estimating the number of clusters from data by Schubert & Sidenbladh

  • Revisiting k-means: New Algorithms via Bayesian Nonparametrics by Kulis & Jordan

  • Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture by Campbell et al

  • Streaming Clustering with Bayesian Nonparametric Models by Huynh & Phung