K Nearest Neighbors classification Special Case with Identical Points

203 Views Asked by At

The question is about KNN algorithm for classification - the class labels of training samples are discrete.

Suppose that the training set has n points that are identical to the new pattern which we are about to classify, that is the distances from these points to new observation are zero (or <epsilon). It may happen that these identical training points have different class labels. Now suppose that n < K and there are some other training points which are the part of nearest neighbors collection but have non-zero distances to the new observation. How do we assign the class label to new point in this case?

There are few possibilities such as:

  1. consider all K (or more if there are ties with the worst nearest neighbor) neighbors and do majority voting
  2. ignore the neighbors with non-zero distances if there are "clones" of the new point in training data and take the majority vote only over the clones
  3. same as 2. but assign the class with the highest prior probability in the training data (among clones)
  4. ...

Any ideas? (references would be appreciated as well)

1

There are 1 best solutions below

1
On BEST ANSWER

This is not a mathematical problem, and is far from having any good answer. KNN is not a theorem, is just a simple classification algorithm with no deep "rules".

Each of proposed methods will work in some problems, and in some they won't. In general, there is no need to actually think about such border cases and simply use the default behaviour (option "1" from your question). In fact, if border cases of any classification algorithm becomes the problem it is a signal of at least one of:

  • bad problem definition,
  • bad data representation,
  • bad data preprocessing,
  • bad model used.