Algorithm for best measure of discrepancy taking into account different sample sizes and applying gradient mapping

13 Views Asked by At

Suppose we have a form of the double entry bookkeeping principle applied in a digital website, where one party counts ad views, and the other party also counts ad views. Of course, due to latency reasons or fraud, one party may count way more (or way less) views than the other, which leads to discrepancies.

I am trying to figure out an appropriate way of measuring the magnitude of the discrepancy and creating a gradient map that goes from green to yellow to red. But the problem I have figured out is that I can't solely rely on percentage difference, because of the significance or insignificance of the size.

If one party counts 2 views, and the other counts 1, that's a 50% discrepancy, which is huge, but 2 views is not a significant number, so the gradient here would be green. Same for 2 and 0, 10 and 0, 100 and 0, 500 and 0 and so on. Now once we get to 700 and 0, gradient would start to become yellow.

Now if one party counts 1000, and the other counts 0, that's a problem, so the gradient here would be red. But when the numbers get big, let us assume one party counts 100 000, and the other counts 90 000, that's not that big of an issue, gradient might be yellow-ish. Now when the numbers get too big, 1 000 000 vs 900 000, that's sort of an issue, gradient would be red.

There are no hard borders for what is significant or not, but I would assume anything above 1000 on one of the two sides would start to make a difference (for smaller numbers). What would be an appropriate algorithm for creating these borders / color switches on the gradient map?