distance function for hierarchical clustering

42 Views Asked by At

I would like to implement hierarchical clustering for a dataset with several dimensions, very different from each other. E.g. meters VS percentage VS times.

I want to adopt a distance method that would allow to deal with that, by standardizing them. Do you have any suggestion?

1

There are 1 best solutions below

0
On

Let $X=(x_1,\ldots,x_n)\in\mathbb{R}^{n\times m}$ be your dataset (with $x_j\in\mathbb{R}^m$). You want to transform it to $Z=(z_1,\ldots, z_n)$ before clustering. Here's a few options:

  1. Mean-variance normalization: $$ z_\ell = (x_\ell - \mu)\oslash{\sigma} $$ where $\mu = (1/n)\sum_i x_i$ is the feature-wise mean, $\oslash$ is Hadamard division, and $\sigma$ is the vector of single dimension-wise standard deviations.

  2. Zero-One (or min-max) normalization: $$ z_\xi = (x_\xi - \alpha_\text{min}) \oslash (\alpha_\text{max} - \alpha_\text{min})$$ where $\alpha_\text{min}$ and $\alpha_\text{max}$ are the vectors of (feature) dimension-wise mins and maxes over $X$.

  3. PCA whitening $$ Z = [(X - M_\mu)W]_K $$ where $M_\mu$ is the matrix of mean vectors (each row is $\mu$), the SVD of $X$ defines $W$ via $X=U\Lambda W^T$, and $[A]_K$ denotes taking the first $K$ columns of $A$ (in this case, removing the less important dimensions to do linear dimensionality reduction and potentially remove some noise).

  4. Cholesky whitening (sets means to zero, removes all linear correlations, and normalizes each variance to 1): $$ Z = (X - M_\mu)\Sigma^{-1}_X $$ where $\Sigma_X$ is the covariance matrix of $X$.

  5. Unsupervised manifold learning (nonlinear dimensionality reduction). I usually put my dataset through t-SNE before clustering or visualizing it, for instance (although I often try PCA first).