Hierarchical Clustering with Ward Distance

74 Views Asked by At

I know how hierarchical clustering (with a certain definition of inter-cluster distance) works. And I know that Ward's procedure is based on the goal of minimizing the sum of the squared errors associated with each aggregation step. If I didn't misunderstand, this can be brought under the umbrella of conventional aggregation methods by defining the ward distance ($c_S$ is the centroid/mean of $S$): $$d(A,B) = \frac{|A||B|}{|A|+|B|}∥c_A-c_B∥^2 . $$

The fact that hierarchical clustering with this distance is equivalent Ward's method is a consequence of the following identity:

$$\sum_{x\in A\cup B}∥x-c_{A\cup B}∥^2 - \sum_{x\in A}∥x-c_A∥^2 - \sum_{x\in B}∥x-c_B∥^2 = \frac{|A||B|}{|A|+|B|}∥c_A-c_B∥^2 . $$

Can someone explain how this last formula is derived?

P.S. That Ward's distance is not a "real" distance (in the sense of metric function), am I right?