How to cluster texts by most relevant words

24 Views Asked by At

Say, I have a bunch of texts, or books, or documents. Each document has some kind of a portrait, where a portrait is an array of pairs (word-weight).

For example, a book about Algorithms and Data Structures can have this portrait:

sorting - 0.025
merge - 0.0003
bubble - 0.0001

Basically, the idea is to get the main words from a doc, the most relevant words to this particular text. These documents, on average, have thousands of pairs.

Now, for the main part, I want to cluster all those documents into different groups. The general algorithm is if two books have a lot of similar words and weights of those words are equal-ish, then those books probably belong to the same group.

The problem is, I don't know where to start, I've read about k-means algorithm, and it seems pretty close to what I have in mind. Is there anything how you can guide me on, or maybe, you know some better algorithms to cluster documents?