Proving triangle inequality holds for Jaccard Distance

1.4k Views Asked by At

The following explanation is given in the Mining of Massive Datasets Textbook to prove that the triangle inequality holds for the Jaccard Distance (this sign $!=$ means does not equal):

"For the triangle inequality, recall from Section 3.3.3 that $\text{SIM}(x, y)$ is the probability a random minhash function maps $x$ and $y$ to the same value. Thus, the Jaccard distance $d(x, y)$ is the probability that a random minhash function does not send $x$ and $y$ to the same value. We can therefore translate the condition $d(x, y) \leq d(x, z) + d(z, y)$ to the statement that, if $h$ is a random minhash function, then the probability that $h(x) != h(y)$ is no greater than the sum of the probability that $h(x) != h(z)$ and the probability that $h(z) != h(y)$. However, this statement is true because whenever $h(x) != h(y)$, at least one of $h(x)$ and $h(y)$ must be different from $h(z)$. They could not both be $h(z)$, because then $h(x)$ and $h(y)$ would be the same."

reference link : http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

May someone elaborate further on this , especially the two last sentences? That is where the gist of the proof is but if someone could present it in a different and more "step by step" way, then that would be great.

Thank you.

1

There are 1 best solutions below

1
On BEST ANSWER

Let $h$ be a random minhash function (note --- I'm not entirely sure what a minhash function is, but I don't believe this actually matters for the explaination).

We have that: $$d(x,y) = \mathbb{P}[h(x)!=h(y)]$$ We want to show that: $$\mathbb{P}[h(x) != h(y)] \leq \mathbb{P}[h(x) != h(z)] + \mathbb{P}[h(z) != h(y)]$$ When they say "When $h(x)!= h(y)$", this is the same as conditioning each probability on the fact that $h(x)!=h(y)$, so we get: $$\mathbb{P}[h(x)!=h(y)\mid h(x)!=h(y)] \leq \overbrace{\mathbb{P}[h(x)!=h(z)\mid h(x)!=h(y)]}^{A} +\overbrace{\mathbb{P}[h(z)!=h(y)\mid h(x)!=h(y)]}^{B}$$ Clearly, the very left probability is $1$. Now, if at least one of the probabilities on the right is $1$, we'll be done. It's easy to see this will be the case. Imagine $h(x) = h(z)$. Then, the probability marked $A$ will be $0$, but because $h(x)!=h(y)$, we'll have that $h(z) != h(y)$ (as $h(x) = h(z)$ here), so the probability marked $B$ will always occur (so will contribute $1$), so the inequality will be satisfied.

Now, imagine $h(z) = h(y)$. Then, reversing the above argument, $B$ will contribute $0$ and $A$ will contribute $1$, and the inequality will be satisfied.

If neither of these are true, then we have that $h(z)!=h(y)$ and $h(x)!=h(z)$, so $A$ and $B$ will both contribute $1$, and the inequality is satisfied.

So essentially, they just check all the possibilities. One possibility they don't check is when the very left probability is $0$ (so condition all events on $h(x) = h(y)$, but we don't need to do this, as all probabilities are greater than or equal to $0$, so whatever appears on the right will satisfy the inequality.