Relative Difference between two non-negative scalars

162 Views Asked by At

I've been looking for a way to compare two values, and every idea I think of falls short in some way. Essentially, I want a function $f$ that meets these conditions for non-negative scalars $x, y$:

If $x = 1000$ and $y = 1100$, $f(x,y)$ is not significantly large.
If $x = 0$ and $y = 1$, $f(x,y)$ is not significantly large.
If $x = 100$ and $y = 300$, $f(x,y)$ is significantly large.
If $x = 0$ and $y = 100$, $f(x,y)$ is significantly large.

Possible options I've considered:

  1. $f(x,y) = x-y$
  2. $f(x,y) = \frac{x-y}{min(x,y)}$
  3. $f(x,y) = \frac{x-y}{(x+y)/2}$
  4. $f(x,y) = \frac{x-y}{max(min(x,y),1)}$

Problems with each option:

  1. $f(6000,6100)$ = $f(100,0)$
  2. Division by zero
  3. $f(1,0) = f(100,0)$
  4. This is the best option I've considered, and is often how K/D ratios are calculated in video games. Still, it feels odd and I believe that there exists a better solution.

Some other properties of my mythical function $f$:

$f(x,y) = -f(y,x)$
$f(x,x) = 0$

I realize I have not been precise with the values I want $f$ to return, but $f(x,y)$ should essentially be the answer to the question "how big is the difference between $x$ and $y$?"

To give an example (and my current use case): I want to compare the usage of words between two texts of similar size. Obviously, "and" will be used a lot by both texts, and one may use it hundreds of times more than the other, but it is overall insignificant. Similarly, one text may use one rare word once while the other doesn't, but this too is insignificant. However, if one text uses a certain word far more often than the other text, this is statistically significant, just like if one text uses a rare word multiple times.

1

There are 1 best solutions below

0
On

This may seem slightly complicated, though it seems to give the sort of function you are looking for.

Taking a Bayesian approach, suppose as a prior that the proportion of occurrences of the first type among first and second types is uniformly distributed on $[0,1]$, i.e. has a $\text{Beta}(1,1)$ distribution

With your observations, your posterior distribution for that proportion would be $\text{Beta}(x+1,y+1)$ distributed, and your posterior probability that the first type is less common that the second would be $q=\int_0^{1/2} \frac{p^{x}(1-p)^y}{\text{B}(x+1,y+1)}\,dp$

That will give a probability estimate $q$ in $(0,1)$. Taking the log-odds (logit) would give $\log\left(\frac{q}{1-q}\right)$ in $(-\infty,\infty)$ which may meet your needs as a function

For example in R, you could construct this function

reldist <- function(m,n){pbeta(0.5,m+1,n+1,lower.tail=TRUE, log.p=TRUE) - 
                         pbeta(0.5,m+1,n+1,lower.tail=FALSE,log.p=TRUE)}

which for example would give

> reldist(1,2)
[1] 0.7884574


x,y         reldist
---------   ---------- 
1,2         0.7884574
1000,1100   4.215137
100,300     55.41071
0,100       70.00787
100,100     0
2,1         -0.7884574
1100,1000   -4.215137
etc.

You can draw significance at any point you wish. An arbitrary two-tailed probability of $5\%$ (so one-tailed of $2.5\%$) would correspond to log-odds critical points of about $\pm\log\left(\frac{0.025}{0.975}\right) \approx \pm 3.66$ so seeing $1000,1100$ might be seen as just significant on this basis; by contrast an arbitrary two-tailed probability of $0.1\%$ would correspond to log-odds critical points of about $\pm 7.6$