Distance metric that is insensitive to correlated variables

49 Views Asked by At

I'm trying to find a suitable pairwise distance metric where the addition of correlated vectors results in (essentially) no change in the distance.

Specifically, consider a set of $k$ vectors each of length $n$, arranged into an $n \times k$ matrix $T_0= [t_1, ... ,t_{k}]$. We could calculate the standard euclidean distance on this matrix, for example, arriving at an $n \times n$ distance matrix $D = dist(T_0)$, where $D_{i,j} = \sqrt{\sum_{m=1}^k{(t_{im} - t_{jm})^2}}$, i.e., the pairwise euclidean distance between element $i$ and $j$, for $i, j \in \{1,...,n\}$. Now suppose we had another $n$ dimensional vector $t_{k+1}$, which is nearly identical to some column already present in $T_0$, (i.e., $cor(t_s, t_k+1) > 0.99$ for some $s$). We can append this to $T_0$ and call this new matrix $T_1$, with dimension $n \times (k+1)$. What I'm looking for a distance metric whereby $dist(T_0) \approx dist(T_1)$.

Intuitively I would have thought that either the Mahalanobis distance or the euclidean distance on the PCA space would take care of this by accounting for the covariance structure of the vectors. But from a bit of experimentation this doesn't seem to be the case. The closest thing I can come up with is some sort of weighted-distance in PCA space, where the weights are given by the eigenvalues. That is, conduct a PCA on $T_1$, resulting in unit-eigenvectors $e_1, ..., e_{k+1}$. Then multiply each eigenvector by it's eigenvalue $\lambda_1, ... ,\lambda_{k+1}$. Then calculate the pairwise distance on the vectors $\{\lambda_1 e_1, ... ,\lambda_{k+1} e_{k+1} \}$.

Intuitively this should essentially erase the contribution of the $k+1$ vector, as it accounts for no additional variation so we're essentially multiplying it by zero. But is there a canonical approach to this? Am I reinventing the wheel, or conversely, just completely missing something and making up a terrible metric when a better one already exists? I'm certain this is a common problem.