Geometric mean on negative numbers - work-around

8.4k Views Asked by At

The geometric mean defined as

$$\text{Geometric mean}=\sqrt[n]{x_1\cdot x_2\cdots x_n}$$

is only defined for a positive inner product. A description and clarifications are in this question. Having negative numbers in a dataset can thus make it very difficult to use. I still prefer the geometric mean over the arithmetic mean due to its better resilience against far-outliers (at least when the dataset is not large enough for a trustworthy median to be used instead.)

A work-around to get rid of the no-negative-numbers issue, could be to add a large enough number before performing the geometric-mean operation and afterwards subtracting that same number from the result:

$$\text{Geometric mean}_\text{work-around}=\sqrt[n]{(10000+x_1)\cdot (10000+x_2)\cdots (10000+x_n)}-10000$$

This moves all data values "out off" the negative zone, performs the operation and then moves the result "back down" again.

I do see minor differences, though, when I compare the geometric mean with and without the workaround on only-positive numbers.

Since I cannot clearly figure out, how big the influence is, I am asking here to have it clarified. Is this workaround useful / correct to use, and can I trust my resulting mean datapoints?

More specifically, I do not clearly see why the true geometric mean and my work-around geometric mean are different, so my I am asking how different they are and how/why they are different.

1

There are 1 best solutions below

3
On BEST ANSWER

If we let your large number be $k$, we could also write your definition as

$$ k\left(\prod_n \left(1 + \frac{x_n}{k}\right)\right)^{\frac{1}{n}} - k = k\exp{\left(\frac{1}{n}\sum_n \log{\left(1 + \frac{x_n}{k}\right)}\right)} - k $$ Now replace the log with its Taylor series: $$ k\exp{\left(\frac{1}{n}\sum_n \log{\left(1 + \frac{x_n}{k}\right)}\right)} - k = k\exp{\left(\frac{1}{n}\sum_n \left(\sum_{m=1}^{\infty} \frac{(-1)^{m+1}}{m} \left(\frac{x_n}{k}\right)^{m}\right)\right)} - k $$ $$ = k\exp{\left( \sum_{m=1}^{\infty} \frac{(-1)^{m+1}}{m k^m} \left(\frac{1}{n}\sum_nx_n^m \right)\right)} - k $$ If the $\frac{x_n}{k}$ are small, which given the description of $k$ being 'large' may be reasonable, we can truncate the series to first order giving $$ = k\exp{\left( \frac{1}{nk}\sum_nx_n \right)} - k = k\exp{\left( \frac{1}{n}\sum_nx_n \right)}^{\frac{1}{k}} - k $$ So in the limit where $k$ is much larger than your data values, your modified geometric mean is really a function of the exponentiation of the arithmetic mean... I can't really see this being a good estimator sadly. Or at least, it doesn't make intuitive to sense me, perhaps someone else has more insight.

EDIT

In fact, it is also true that $$ \lim_{k\to\infty} k \exp{\left(\frac{x}{k}\right)} - k = x, $$ so rather interestingly we find that for very large $k$, your modified geometric mean is just the arithmetic mean: $$ \lim_{k\to\infty} \; \left[ k\left(\prod_n \left(1 + \frac{x_n}{k}\right)\right)^{\frac{1}{n}} - k \right]= \frac{1}{n}\sum_n x_n . $$ Additionally we know that the minimum value of $k$ you could choose yields an answer of zero. I suspect (seen in numerical testing, but not sure how to prove it) that the result increases monotonically between zero and the arithmetic mean as you increase $k$.

If true, this isn't really good news for your estimator... it would mean you can never get a value larger than the arithmetic mean (obviously a big problem if an outlier is below the mean) and to me at least, this suggests it has no value.

Although in fairness this is true for the normal geometric mean anyway, via the AM-GM inequality.