How to convert table into a distance function?

74 Views Asked by At

Been stumped on this past paper question for a while, it's in the context of clustering and the next part is using single linkage bottom-up hierarchical clustering to form a dendrogram using your distance function.

My question is what exactly is a distance function in this context? I know about euclidean and Manhattan but how would you use it for this?

table

1

There are 1 best solutions below

0
On

There is not going to be a formula for this distance (like for Manhattan, Euclidean, etc). The values must be based on the given table. At a minimum, we'd like to have:

  • a distance from product to itself is zero.
  • a distance between two products is smaller if they are bought together more often.

Let $n(p,q)$ be the number of times that $p$ and $q$ are bought together. My first attempt would be: $$d(p,q)=\begin{cases} 0\quad &\text{ if }p=q \\ 1/n(p,q)\quad &\text{ if } p\ne q\end{cases}$$

You can then do some computations to check whether those desirable properties hold (symmetry is ok, the triangle inequality is unlikely to hold).

More sophisticated distances are:

$$d_1(p,q)=\begin{cases} 0\quad &\text{ if }p=q \\ \sum_{r}n(p,r)/n(p,q)\quad &\text{ if } p\ne q\end{cases}$$ which can be interpreted as the reciprocal of conditional probability of buying $q$, having bought $p$. Unfortunately, this is not symmetric.

$$d_2(p,q)=\begin{cases} 0\quad &\text{ if }p=q \\ \sum_{r}(n(p,r)+n(q,r))/n(p,q)\quad &\text{ if } p\ne q\end{cases}$$ which is a symmetric version of $d_1$, but not as transparent.

To get the triangle inequality from anything like this would be unusual. But in applications this is not necessarily a problem.