As we know, there are many different ways to compare the level of similarity between two strings of text (e.g. https://en.wikipedia.org/wiki/String_metric).
I have often found myself confused as to which of metrics are better suited for different problems (e.g. a certain metric might say that two strings are very similar, but a different metric might say that the same two strings are very different) - this led me to the following idea: For a given problem, perhaps I could try to take the average for multiple metrics and get a more "balanced" metric - is this a mathematically valid approach??
As an example, I simulated this dataset in R:
set.seed(123)
myFun <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
col1 = myFun(10)
col2 = myFun(10)
col3 = myFun(10)
col4 = myFun(10)
example = data.frame(col1, col2, col3, col4)
col1 col2 col3 col4
1 ONHGW1599P VTYUV6387G LPTDH0712U KYIQF7684P
2 SVZUU4237T RVNEL9039J NMYTN9644I ATIZR2378Q
3 NYGLG3937F QYCHT7281E SSPLU5370G YZEQQ5509U
4 CZJOU4089K BNNSN9175F YXXVM3501G SGWZE4213M
5 JEIJF2907H DYGJQ4715P GTVQB3069Z JYNUY9271U
6 RSSMY0294V MWCRN0151X VOKJK8720J UWNGT9433H
7 VYDGB8469V ECWJV6810U ZGPTM4055X VZFUC7826U
8 KYNIE0041G VHVLC9830W IDTKN9761V MTAZY1706G
9 EIQIH8508P SPZBH8174K GAHYF0473W KXZIN7879T
10 TCKJL7391Q YLOJN6911D BHCWY6098Z KYJTA9065R
Suppose I am interested in comparing all possible comparisons between (col1,col2) and (col3,col4). The R programming language is able to evaluate the following distance metrics: method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw","soundex")
I wrote this loop that can calculate all these metrics for all possible combinations:
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw","soundex")
library(stringdist)
results = list()
for (i in 1:length(method))
{
method_i = method[i]
name_1_i = paste0("col1_col_2", method_i)
name_2_i = paste0("col3_col_4", method_i)
p1_i = stringdistmatrix(col1, col2, method = method_i, useNames = "string") %>%
as_tibble(rownames = "a") %>%
pivot_longer(-1, names_to = "b", values_to = name_1_i)
p2_i = stringdistmatrix(col3, col4, method = method_i, useNames = "string") %>%
as_tibble(rownames = "a") %>%
pivot_longer(-1, names_to = "b", values_to = name_2_i)
p1_i = p1_i[,3]
p2_i = p2_i[,3]
final_i = cbind(p1_i, p2_i)
results[[i]] = final_i
}
final = do.call(cbind.data.frame, results)
final = cbind(col1,col2, col3,col4, final)
I can now proceed to take these averages as follows and append them to the original data:
average_col1_col2_dist = (final$col1_col_2osa + final$col1_col_2lv + final$col1_col_2dl + final$col1_col_2hamming + final$col1_col_2lcs + final$col1_col_2qgram + final$col1_col_2cosine + final$col1_col_2jaccard + final$col1_col_2jw + final$col1_col_2soundex)/10
average_col3_col4_dist = ( final$col3_col_4osa + final$col3_col_4lv + final$col3_col_4dl + final$col3_col_4hamming + final$col3_col_4lcs + final$col3_col_4qgram + final$col3_col_4cosine + final$col3_col_4jaccard + final$col3_col_4jw + final$col3_col_4soundex)/10
final = data.frame( col1, col2, col3, col4, average_col1_col2_dist, average_col3_col4_dist)
col1 col2 col3 col4 average_col1_col2_dist average_col3_col4_dist
1 ONHGW1599P VTYUV6387G LPTDH0712U KYIQF7684P 7.985784 7.728889
2 SVZUU4237T RVNEL9039J NMYTN9644I ATIZR2378Q 6.692500 7.310131
3 NYGLG3937F QYCHT7281E SSPLU5370G YZEQQ5509U 7.523311 7.123930
4 CZJOU4089K BNNSN9175F YXXVM3501G SGWZE4213M 6.471213 7.338889
5 JEIJF2907H DYGJQ4715P GTVQB3069Z JYNUY9271U 6.276818 6.181671
6 RSSMY0294V MWCRN0151X VOKJK8720J UWNGT9433H 6.578095 7.313864
Now my question - is this approach that I have outlined mathematically valid?
I know that you can take the average of any set of measurements, but the resulting average might not always be "logical" (e.g. measurements in different units). Inspecting the results, I see that "qgram distances" appear to be significantly larger than "cosine distances" - this makes me think that perhaps I might be able to "normalize" each column (e.g. final = scale(final)) - but again, I am not sure if this is a mathematically valid approach.
Does anyone have any ideas if it is mathematically valid to compare the average of multiple string metrics?
Thank you!
Yes, it is mathematically valid to take the average for multiple metrics and get a more "balanced" metric.
The fundamental properties or, in fact, the definition of a metric (a.k.a. distance) $d$ on a set $S$ is that $d$ is a function from $S\times S$ to $\Bbb R_{\ge0}$ such that
In the case of strings, the definition says a distance $d$ on pairs of strings must
If a function satisfies the three properties above, then we can call it a distance (a.k.a metric) on strings mathematically.
Now the question is when you take some weighted sum of some metrics, will you get a metric still?
Let us formulate the question mathematically. For simplicity, suppose we use only two metrics.
Lemma: Suppose $d_1$, $d_2$ are two metrics on $S$. $w_1, w_2$ are two positive numbers(weights). Let $d=w_1d_1+w_2d_2$. Then $d$ is also a metric on $S$.
Proof: All three required properties of $d$ to be a metric are straightforward to verify. I will verify the triangle inequality here.
For any $x,y,z\in S$,
$$\begin{aligned} d(x,y)+ d(y,z)&=(w_1d_1(x,y)+w_2d_2(x,y))+ (w_1d_1(y,z)+w_2d_2(y,z))\\ &=w_1(d_1(x,y)+d_1(y,z))+ w_2(d_2(x,y)+d_2(y,z))\\ &\ge w_1d_1(x,z) +w_2d_2(x,z)\\ &=d(x,z)\\ \end{aligned}$$ where the inequality above comes from the triangle inequalities of $d_1$ and $d_2$.
The case of weighted sum of more than two metrics are similar. We will obtain a metric as well.
You might have suspected that there are more ways to construct "more balanced" metrics from given metrics besides weighted sum.
Yes, there are many of them. It would be another post to explore them. Hopefully and probably, some kind of weighted sum is good enough for your practice.