What is a useful distance metric (euclidean?) if I want to compare different amounts of attributes depending on the instance being compared?

41 Views Asked by At

Imagine that I have a data set with 15 examples, and each has 100 attributes. I want to use these 15 examples to classifiy unseen examples based on the similarity/distance between them. The classification is based on the nearest neighbors class.

Specifically in my case is that the meaningful attributes without noise or better say which features contain relevant information are different for each example. For instance, for the examples 1 to 5, the meaningful attributes are distributed in 90 features (from 0 to 90). For the examples 6 to 10, only the features 0 to 5 are relevant (the other 95 are meaningless/contain noise for comparing the similarity to these instances). Similiar, the examples 11 to 15 are represented best only by features 6 to 10 because the others contain noise to detect these classes. And I assume that only comparing meaningful features should result in better classification performance.

I want to use this knowledge about the meaningful features depending on each instance. Therefore, I decided to use a weighted distance (currently euclidean (p=2) or Manhattan (p=1)) with a weight vector that consists of zeros and ones. Moreover, I normalize each weight as 1/sum(weights), so the sum of all weights used for similarity calculation is equal to 1, and the other entries are still zeros. During retrieval, I only know the relevant attributes for my training data with 15 examples, and I restrict the comparison to the known relevant attributes.

I found some sources online they state that in case of sparse high dimensional data (what is the case for me, currently I have 20k features which are previously selected through a significance test) to use cosine. But its the same for weighted cosine distance. See for example the paragraph "High Dimensionality"

The problem: I mostly get the highest similarity / smallest distance with instances that only compare a small number of features. I assume that the problem is that the probability is higher that the distance becomes small by comparing fewer features than comparing against a larger amount of features (although I weight them as an average depending on the number of features used).

  • Do I use the wrong metric (Is Minkowski distance not suitable)?
  • Do I use weights for a metric in the wrong way?