Classification with equal number of elements per class

155 Views Asked by At

I am looking for a way to classify a set of points $\vec x$ such that each class contains roughly the same number of points. Currently I start with the first point $\vec x_0$ use this as reference point for the first class and the next point get eithers assigned to that class, if $|\vec x - \vec x_0| <eps$ or I take it as the reference point for a new class. Similar all the other points are assigned to a class, when the distance to the reference point is minimal or I take them as a new reference point, if the smallest distance is bigger than the threshold $eps$. However, in this way I get few classes with lots of points and many classes with few points. How can I classify the points such that I have roughly the same number of points per class?

I am pretty sure that there exists some algorithm, but I dont really know what terms to search for.

PS: wasnt sure how to tag the question, didnt find any better tag...

PPS: Some clarification...

  • the "points" are vectors with ~50 components.
  • the classification should be such that points within each class have minimum distance from each other
2

There are 2 best solutions below

0
On

For what it's worth:

Start with a balanced assignment of the points to the classes, for instance by sorting on the first component.

(Or trying all components in turn and keeping the one that leads to the best compactness. Another option is to start with a K-NN assignment - which won't necessarily be balanced -, keep the class centers and assign them the nearest neighbors in a greedy way.)

Then iteratively improve by

  • finding the class member which is the farthest to its class center,
  • trying to swap it with all members of the other classes and seeing if there is a global distance reduction,
  • keeping the most beneficial swap.

(Or even simpler, rate all possible swaps in terms of distance reduction and apply the best of them.)

This will be costly and may lead to dead-ends, but hopefully better than nothing.

0
On

Naive idea based on this comment:

@YvesDaoust I am not sure if "cluster" is the right term, because the points are rather homogeneously distributed (ie. there are no clusters) but the density of points drops towards the borders – tobi303 10 mins ago

Perhaps try rectangular parallelepipeds that are widest near the center of the region with widths that decrease as you move away from the center. This should be easy to program. Experiment to get the right rate of decrease. It may not be the same for all the dimensions.