Is it safe to say 'two distributions are 70% similar' if their total variation distance is 0.3?

210 Views Asked by At

I'm writing an engineering paper and looking for a metric to represent the similarity of two discrete probability distributions. As far as I know, there are lots of such metrics or distances, but I found out that the total variation distance is the most simple.

$Dist = \Sigma_{x} (\frac{1}{2} |f(x) - g(x)|)$

As the target venue has nothing to do with mathematics, I want to keep it as straightforward as possible. For example, saying that the distance is 0.3 is maybe neither unfamiliar nor straightforward to understand for the readers.

But I noticed that the total variation distance is always between 0 and 1. So I was tempted to say that the similarity is 70% (= 1 - 0.3), rather than the distance is 0.3. Basically, it is none other than defining the concept of similarity as 1 - distance, but I wonder whether or not this approach is safe. I hope this is common but I expect it isn't.

1

There are 1 best solutions below

0
On BEST ANSWER

Since the total variation distance is a statistical distance it's probably more important to explain to the layperson if it's a difference between two random variables, two probability distributions or samples, or the distance between an individual sample point and a population or a wider sample of points rather than to simplify it as "the similarity is 70%". The mean and symmetry are also important, but that's a different point (hidden by using tvd).

$D_{TV} = \Sigma_{x} (\frac{1}{2} |f(x) - g(x)|)$

The total variation distance is the largest possible difference between the probabilities that the two probability distributions can assign to the same event, so "the similarity can be greater than 70%, while the difference can be as great as 30%".

When trying to express a distance correlation to a layperson simply saying "it's 70% similar" is 'selling' the similarity when there may not be as much similarity as was hoped for. See this example of the correlation of x and y for various distributions from Wikipedia's distance correlation webpage:

Distance Correlation Example

Statistics is a complex subject, the presentation can skew ones beliefs; as you can see the "similarity" needs to be fairly great for one to be satisfied of a similarity, when the number is that low I would prefer distance over similarity.

See also:

The image in this question, quite similar or quite different? I wouldn't want to hear about the "similarity": https://stats.stackexchange.com/questions/6907/an-adaptation-of-the-kullback-leibler-distance .

Pinskers's inequality explains that a naive inversion can not always hold.

By your formula ($1-D_{TV}$) if your value was 0.05 you would describe the similarity as 95%, which discards the means unless they're tight and doesn't properly emphasize the distance between a minimal and maximal instance.