Currently I'm struggling with a (for me) new field, namely clustering. I would really appreciate any help I could get!
The starting situation is that a data set $(x_k)_{k\in\{1,\dots,n\}} \subseteq \mathbb{R}^N$ is given. The task is to partition this set into clusters $C_1,\dots, C_m$ (where $m$ is not preset) so that with a given $c \in \mathbb{R_{>0}}$ holds $$ \forall i \in \{1,\dots,m\} \ \forall x,y \in C_i \colon \ \Vert x-y \Vert \leq c \\ \forall i,j \in \{1,\dots,m\} \ \forall x \in C_i \ \forall y \in C_j \colon \ i \neq j \ \Longrightarrow \ \Vert x - y \Vert > c $$ and so that $m$ is minimal. In other words, what I'm looking for is: How can I divide the initial data set into as few clusters as possible so that the elements within each cluster have at most distance $c$ and so that two elements of distinct clusters have at least distance $c$? (One maybe could also ask this question where distances are replaced by similarities.)
Does anybody know some keywords I could look for? It would be great if there already was an algorithm or easy implementation for that. I'm also happy if somebody knows something that solves a problem which is close to mine.
Also, is there a method which would allow to replace the "$\Vert x-y\Vert$" by an arbitrary distance measure $d(x,y)$ and which would only rely on the distances between already given points and no others? By that I mean that some of my ideas would use custom distances (or similarities) where it would be too expensive to calculate the distance for new points (like for example for the mean of some of the given points to another point).
Regards Murp
It seems that you are interested in partitional clustering; I would start having a look at the kmeans partitional clustering algorithm. The distance used by the algorithm is Euclidean and the number of clusters $m$ is an input given at the very beignning by the user. Typically one runs kmeans different times, with different numbers of clusters and assesses the quality of the partition with ad hoc methods. If you want to implement more general distances on your discrete dataset, then I would suggest to have a look at PAM (Partition Around Medoids): the R-implementation of PAM allows the user to give as input dissimilarity matrices produced with any user defined distance.
Density based clustering algos are interesting too: dbscan is a rather famous one.
On software: I believe R has all you need to explore the possibilities of partitional / hierarchical clustering. Additional packages can provide you with more sofisticated and recent algorithms.