intra cluster sampling

21 Views Asked by At

Apologize for the longish context setting but we are working on a recommender system which needs to let users know which erroneous files need attention. These erroneous files have been generated via another pipeline (not relevant to this discussion). The idea is to NOT expose the user to all the 100's or 1000's of erroneous files but be able to pick samples that are representative of the population. Here's what i have coded / thought of so far.

The first step is generating a feature set and using KMeans clustering and monitoring intra and inter cluster distance to ensure tight fitting and well spread out clusters. Now within each cluster, i am calculating distance of every point from the centroid. I am seeing lots of patterns here (for e.g. multiple files being almost at the same distance from the centroid) BUT unable to formulate it in a mathematical sense. Can i simply create a histogram based on the distances and then randomly sample files from within the bucket ? i initially tried making a normal assumption and calculated mean & SD to try and sample files from +/-x STD DEV but the data is very obviously of unknown distribution and treating it as a histogram seems to make sense to me .. can someone here please point to better cluster sampling techniques ?

1

There are 1 best solutions below

4
On

so here's what i tried (and solved the problem statement, at least, for my use case)

  • used my feature vector for every document ( n x m dimension matrix ) ..since its rectangular i used SVD to find the biggest singular value for every document..and then i took the corresponding right singular vector (RIGHT one since the left one would vary based on the length of the document , while the "m" dimension will stay common since its a fixed size embedding) and compared it across the cluster using cosine similarity i used a super high similarity threshold of 0.95 (or cosine distance <= 0.05) and was able to form sub clusters with relative ease ..early days since my testing team is going to try and break this but i thought i could share this