What is the best way to measure similarity between two histograms

273 Views Asked by At

What is the best way to measure the similarity between two histograms? For example, in the following pictures, how can I tell if the distributions are similar enough?

enter image description here

enter image description here

I now have 2 lists of values, and I've normalized them to fall between 0 and 1.

I've tried multiple statistical tests including Pearson, Spearman, and Kolmogorov-Smirnov and it looks like Spearman is the best test to use. However, the Spearman is not consistent all the time, it sometimes gives me a high "s" value but the shapes of the distribution are not similar enough. In theory, a higher (positive) "s" means the values are strongly correlated. Am I even on the right track using correlation to measure similarities? Are there any other tests that can be used to do this?

corr0.89_1 = [10.7441, 8.9568, 11.0018, 9.29803, 8.92043, 8.78492, 13.5503, 6.74334, 6.14392, 5.75271, 28.851, 26.8173, 6.52642, 6.56071, 5.7169, 7.3095, 6.36379, 5.74984, 7.10243, 5.87364, 11.2827, 2.94984, 2.84836, 22.8551, 24.8372, 10.6571, 9.7891, 11.3021, 5.89328, 10.1372, 24.0525, 3.49401, 2.16394, 11.2825, 11.6859, 7.9918, 13.2742, 11.1194, 2.49575, 16.733, 27.918, 3.27145, 14.3346, 20.4979, 13.0808, 13.6282, 14.1474, 25.0414, 8.06032, 280.803, 22.0135, 18.2725, 12.9601, 7.64593]
corr0.89_2 = [9.14167, 6.30561, 7.7479, 8.05475, 7.14188, 7.62774, 9.18454, 1.48037, 1.5912, 2.07612, 21.302, 22.7082, 2.67858, 2.25732, 1.74804, 2.04191, 2.03539, 1.78882, 2.57568, 1.6512, 8.62473, 2.99236, 3.13484, 13.014, 16.2016, 9.17172, 7.97379, 9.12539, 4.8298, 8.42477, 16.0582, 2.68252, 1.92429, 5.6744, 4.70516, 5.20169, 11.0945, 9.10398, 2.68375, 13.6299, 17.3429, 3.19181, 9.41762, 12.2805, 9.92005, 11.5985, 11.7269, 17.4832, 6.66996, 60.8647, 13.9616, 14.9909, 10.4712, 6.13891]
corr1.0_1 = [0.00905783, 0, 0.0075662, 0, 0.00583336, 0, 0.0101741, 0.00617847, 0.00474902, 0, 0.0243326, 0.0300779, 0.0062144, 0.00581433, 0, 0.00712057, 0.00703617, 0, 0, 0, 0.0101258, 0.00844863, 0.014988, 0.0248553, 0, 0.00680134, 0.00762619, 0.00701553, 0.0106525, 0.00425654, 0.0160354, 0, 0, 0, 0, 0, 0.0110151, 0.00874536, 0, 0.0182528, 0.0291939, 0, 0.0426431, 0.0141304, 0.0139076, 0.0182638, 0.0177141, 0.021119, 0, 12.3977, 0.0121492, 0.016053, 0.0148212, 0.00767271]
corr1.0_2 = [0.128504, 0.119172, 0.0403692, 0.148327, 0.132162, 0.0366454, 0.139191, 0.0464803, 0.0235099, 0.0333772, 0.0427275, 0.0510047, 0.0278845, 0.0202271, 0.0918039, 0.129276, 0.0266636, 0.0399166, 0.693549, 0.131911, 0.134276, 0, 0, 0.248764, 0.17239, 0.0450586, 0.0932654, 0.0671463, 0.239433, 0.102551, 0.378029, 0.031807, 0.0181028, 0.107356, 0.145449, 0.0735069, 0.788291, 0.496569, 0.0209139, 0.0983066, 0.0530917, 0.0755444, 0.25198, 0.550969, 0.172254, 0.104131, 0.113987, 0.548016, 0.302768, 126.145, 0.886364, 0.107977, 0.4037, 1.23249]
1

There are 1 best solutions below

9
On BEST ANSWER

First of all you need to define unambiguously what you mean by similarity for the problem you are solving, otherwise each of these coefficients will correctly measure similarity as per their own definition. Let us suppose you define similarity as how similar is the shape of one distribution to the other. In this case, if you resize both histograms to the same scale and place one on top of the other, in case of perfect similarity, each will completely overlap the other; if not, one or both will have some un-lapped parts.

For this definition, Spearman's coefficient will not work since it only measure the correlation between the rank order of the class intervals. We can have two distributions in which the rank order of the intervals are identical but the height of the interval in the two distribution are vastly different. In theory, you can have infinitely many distribution whose Spearman's coefficient is 1 but their shapes is all different. On their other hand, if the Pearson coefficient is 1 then their shape will be exactly be the same and there will be no un-lapped parts.

For more rigorous measures, refer to this link: Similarity measure between multiple distributions