Normalizing data sets where some sets may contain incredibly large numbers.

Question

Normalizing data sets where some sets may contain incredibly large numbers.

944 Views Asked by Bumbble Comm At 04 Apr 2026 - 4:26

The title for this one was a bit difficult to word, so let me explain in a bit more detail (I'm not a mathematician, so please pardon any wrong terminology I use -- corrections are perfectly welcome, though):

I have a few hundred data sets that look something similar to this one:

[0:76, 1:24, 2:44, 3:15, 4:66, 5:89, 6:40, 7:102, 8:12, ...]

As you can see, I have indices, which are plotted on X-axis, and values, which are plotted on Y-axis.

I would like to normalize all of this data between values of 0 and 1, so I decided to use the feature scaling formula:

$$X' = a + \frac{(X - X_{min})(b - a)}{(X_{max} - X_{min})}$$

Ref: https://en.wikipedia.org/wiki/Normalization_(statistics)

One problem, though, is that some of the data sets may have incredibly large values, when compared to the rest of the sets:

[0:12000, 1:8909, 2:9045, 3:1001, 4:289, 5:6784, 6:3400, 7:1899, 8:1023, ...]

So, when all of these get plotted after normalization, it looks like most of the data is at near-zero, with only large-value sets being visible. While, technically, it's correct and scaling was done properly, I would like to set some sort of limit and if a value exceeds that limit it just stays there. So, values at or above limit would be plotted as 1, while the rest of the values would be below that, but would still be visible.

I was toying around with how I can get this done, and was thinking of perhaps using the mean as the limit.... but the mean is highly influenced by these few large-value data sets. So, I'm having a bit of a difficulty determining the limit.

Edit:

Clarification on my notation in the above data sets: X:Y -> X is the X-axis value (point in time), Y is the Y-value (value at that point in time).

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2018-03-20 01:20:11

If you just want to do a cutoff, you might want to use the $75^{th}$ percentile or something like that. You might want to use $1.5$ times the $75^{th}$ percentile so that nothing goes off the top if you don't have a few large values.

As saulspatz mentioned you can take the log of the $y$ values to bring the large ones more in line.

Another approach is to plot two graphs, one incorporating the small values and one incorporating the large ones. Again the cutoff might be based on the percentile.

Maybe the important thing in your data is whether the data set is large or small and the variation between the small ones is not important. This would say the scaling you are doing is fine.

You need to decide what is the important thing to show.

Normalizing data sets where some sets may contain incredibly large numbers.

There are 1 best solutions below

Related Questions in STATISTICS

Trending Questions

Popular # Hahtags

Popular Questions