So I have a rather large dataset where values are from the interval $[0,1] \in \mathbb{R}$. But the problem is that a big portion of the values are extremely close to $0$.
So firstly I'm looking for a normalization function that would get those extremely small numbers to more meaningful values, but on the hand keep all the elements in the initial interval. I've two guiding principles for the envisioned method:
- $a_2 \ge a_1$ where $a_2$ is the new value for $a_1$ after normalization (So we don't want an element's value to decrease after the normalization process).
- $a_1 \ge b_1 \Longrightarrow a_2 \ge b_2$ (meaning if $a$'s value is bigger(or equal) than $b$'s initially, it should still hold after normalization).
Secondly I've this more ambitious goal: fixing the average of the data set to a certain value via some normalizing method.
For instance if we wanted to set the average to $0.5$ we could simply multiple all elements of the data set by $\frac{0.5}{initial\_average}$, however that could result into some elements falling out of the interval $[0,1]$ since some values may exceed $1$.
Your help is much appreciated. Please leave a comment if I wasn't clear enough with the description.
As eigenjohnson suggested, taking the logarithm is a reasonable way to deal with numbers of different scales (if none of the values are exactly equal to $0$). However, you want the numbers to remain in $[0,1]$, and logarithm will not do that. I suggest raising them to a small power $p>0$. This stretches the neighborhood of $0$: for example, here is $p=0.1$:
There is no nice analytic way to get the mean of transformed values to be $0.5$; you'd have to solve some unpleasant equation for $p$. But it is very easy to set the median to $0.5$. Just find the median $m$ of your data and let $p=\ln(0.5)/\ln(m)$.