Properly comparing two histograms

7.6k Views Asked by At

I need to implement a function (in Golang) to compare the similarity/distance of two histograms. The histograms were generated from two different images. I have searched on the internet and have found some metrics that can be used to perform this comparison, for example, Chi Square and Intersection.

Here are all the metrics and formulas that I have found:

Chi Square :

Formula:

$x^2 = \sum_{i=1}^{n}\frac{(hist1_{i} - hist2_{i})^2}{hist1_{i}}$

Corresponding code:

var sum float64
for index := 0; index < len(hist1); index++ {
    numerator := pow(hist1[index] - hist2[index])
    sum += numerator / hist1[index]
}

Where pow is a function that returns value * value.

Euclidean Distance :

Formula:

$D = \sqrt{\sum_{i=1}^{n}(hist1_{i} - hist2_{i})^2}$

Corresponding code:

var sum float64
for index := 0; index < len(hist1); index++ {
    sum += pow(hist1[index] - hist2[index])
}
sum = math.Sqrt(sum)

Normalized Euclidean Distance :

Formula:

$D = \sqrt{\sum_{i=1}^{n} \frac{(hist1_{i} - hist2_{i})^2}{n}}$

Corresponding code:

var sum float64
n := float64(len(hist1))
for index := 0; index < len(hist1); index++ {
    sum += pow(hist1[index]-hist2[index]) / n
}
sum = math.Sqrt(sum)

Intersection :

Formula:

$D = \sum_{i=1}^{n} min(hist1_{i}, hist2_{i})$

Corresponding code:

var sum float64
for index := 0; index < len(hist1); index++ {
    sum += min(hist1[index], hist2[index])
}

Where the min function returns the minimum value.

Normalized Intersection :

Formula:

$D = \frac{\sum_{i=1}^{n} min(hist1_{i}, hist2_{i})}{max(\sum_{i=1}^{n}hist1_{i},\sum_{i=1}^{n}hist2_{i})}$

Corresponding code:

var sum float64
for index := 0; index < len(hist1); index++ {
    sum += min(hist1[index], hist2[index])
}
sum = sum / max(sum(hist1), sum(hist2))

Where the sum function returns the sum of all elements in the slice and the max function returns the maximum value.

References :

  1. OpenCV Histogram Comparison: http://docs.opencv.org/2.4/doc/tutorials/imgproc/histograms/histogram_comparison/histogram_comparison.html
  2. Euclidean Distance: http://www.pbarrett.net/techpapers/euclid.pdf
  3. Histogram intersection for change detection: http://blog.datadive.net/histogram-intersection-for-change-detection/
  4. The Simplest Classifier: Histogram Comparison: https://mpatacchiola.github.io/blog/2016/11/12/the-simplest-classifier-histogram-intersection.html
  5. Histogram Intersection with two different bin sizes?: https://dsp.stackexchange.com/questions/18065/histogram-intersection-with-two-different-bin-sizes

Questions :

Now, I have some questions based on the above description:

  1. I'm not a mathematician (I'm a software developer), so can someone confirm to me if all 5 formulas are correct?
  2. Which of the 5 metrics is the best (or commonly used) to compare two histograms? (As I have read people uses commonly the Chi-Square and the Intersection, but I'm not sure).
  3. Is there some other good metric (or commonly used) to compare two histograms?
  4. As I understand: when using the Chi-Square, Euclidean Distance or Normalized Euclidean Distance, the closer to zero is the result, higher is the similarity between the histograms. In the other hand, when using the Intersection or the Normalized Intersection, higher results means higher similarity between the histograms. If it is correct, is there a way that I can invert the results from the Intersection and Normalized Intersection so it 'works' as the other metrics (closer to zero means higher similarity)? As the Normalized Intersection results range is between 0-1 it is easy to invert, but I don't know if it is possible to the Intersection metric.

Notes :

  • If someone find something wrong in the codes above, please let me know.
  • I'm sorry if I have written some bullshit on this question, as I explained I am not a mathematician so I don't have much knowledge about this.
  • Any suggestion to improve the question, the formulas or codes are welcome.
  • I'm not sure if I have used the correct tags, so please feel free to edit it.
1

There are 1 best solutions below

11
On BEST ANSWER

From what I understand you're comparing two lists of numbers.

The Euclidian distance seems like a good method (which is equivalent to the normalized Euclidian distance, you just scale it down to $[0,1]$).

The Chi square method is quite similar to the Euclidian distance but it punishes relative difference instead of absolute :

For example, let's have : $h_1=[1,10]$ and $h_2=[2,11]$.

In this example the Euclian method will treat the difference between ($1$ and $2$) and ($10$ and $11$) the same, whereas the chi method will consider the difference between $1$ and $2$ to be more significant than the one between $10$ and $11$.

That's up to you to decide which one you prefer.

You can discard the intersections as they really have no value, as is, in your case.

Another simple metric you could use is this one :

$$d = \sum_n|h1_i-h2_i|$$

This one, and the euclidian metric, are classic metrics in mathematic. The main difference between those two is the euclidian method will punish a lot more a large difference.

As for the code, are you sure you can't fix those square by any methods ? Also, for the normalized euclidian, you might want to assign a variable to len(h1) outside the loop instead of searching for it each iteration.

EDIT : As I read deeper into your references, I think I might have overlooked the intersection method. Can you explain a little bit more what you're trying to achieve by comparing the two histograms ?