Background: Let S be a set of scores. These scores can be any positive real number. The size of S is large. A specific score s is subject to change frequently.
I would like to tell each user (a user owns one and only one score) how their score s compares to all the other scores in S, preferably on a scale of 0-100. The first thing that came to mind was to tell the user what percentile their score was in the distribution.
My question is one regarding performance: It seems that to find what percentile a score s belongs to, S must be first sorted. Since S might be very large, and each score s is subject to frequent change, it seems that maintaining a sorted S might become very expensive to maintain. Is there a workaround to this problem? I considered that instead of keeping a continuously changing S, perhaps I take a snapshot of S and sort it. I would take one of these snapshots every couple of hours, calculate the percentile using that snapshot and telling the user this “estimated” percentile.
Obviously estimated isn’t perfect but if performance becomes an issue maybe its good enough. Is this a good solution? Can I do anything to improve on this solution? Is there a better method that is more performant/accurate?
How do you get notified of score changes? Once you have sorted $S$ you can have the list in score order and an index list that shows where the percentiles (thousandths or whatever fraction you want) break. Then if you are told that score $x$ changed to $y$ you can update the lists by shifting everything between $x$ and $y$ by one slot and adjusting the percentiles as required. This works very efficiently if score changes are small compared to the range, so you don't have to move many entries per score change.
If the score changes are large it will become more efficient to collect them for a while and sort the list anew. If there are $N$ scores, a sort takes about $N \log N$ comparisons. If the score changes have no correlation between the start and finish the expected change is $\frac N3$, so after a few times $\log N$ it is more efficient to resort than to maintain the list. The good news is that if you have a few times $\log N$ changes, nobody's ranking (except those that change) can change very much. It would be reasonable to leave those who do not change at their old percentile. You can even guarantee it is accurate to within a certain amount, calculated by assuming every change took somebody past them in the same direction. Since the list is long, it will take many changes to change a percentile ranking. For those that change, see where the new score puts them in the percentile rank. Then resort when there are enough changes.