Apologize for the longish context setting but we are working on a recommender system which needs to let users know which erroneous files need attention. These erroneous files have been generated via another pipeline (not relevant to this discussion). The idea is to NOT expose the user to all the 100's or 1000's of erroneous files but be able to pick samples that are representative of the population. Here's what i have coded / thought of so far.
The first step is generating a feature set and using KMeans clustering and monitoring intra and inter cluster distance to ensure tight fitting and well spread out clusters. Now within each cluster, i am calculating distance of every point from the centroid. I am seeing lots of patterns here (for e.g. multiple files being almost at the same distance from the centroid) BUT unable to formulate it in a mathematical sense. Can i simply create a histogram based on the distances and then randomly sample files from within the bucket ? i initially tried making a normal assumption and calculated mean & SD to try and sample files from +/-x STD DEV but the data is very obviously of unknown distribution and treating it as a histogram seems to make sense to me .. can someone here please point to better cluster sampling techniques ?
so here's what i tried (and solved the problem statement, at least, for my use case)