How to determine the ranges if I want to divide a data set into N segments taking the average/frequency into account?

351 Views Asked by At

I have a data set of floating numbers such as follows:

[0.01053,
 0.00444,
 0.00957,
 0.04564,
 0.00709,
 0.01338,
 0.02857,
 0.02593,
 0.01056,
 0.05366,
 0.02252,
 0.0237,
 0.01288,
 0.02905,
 0.0119,
 0.04911,
 0.01761,
 0.02105,
 0.01859,
 0.05769,
 0.00576,
 0.01736,
 0.00948,
 0.01465,
 0.032,
 0.00429,
 0.10266,
 0.01794,
 0.01794,
 0.00993,
 0.01415,
 0.00866,
 0.02613,
 0.03759,
 0.02885,
 0.01556,
 0.00881,
 0.01408,
 0.01544,
 0.04186,
 0.00336]

The average for this data set is: 0.02244

The number or sections I need is: 3

I need to create 3 equal sections starting from 0 that take into account the average. In other words if I have some numbers in the set that are very large but most of the numbers are small than I want to divide the segments so that I create the segments around the average and ignore the outlying numbers.

The first range must also start from 0.

Currently I thought I could use the following formula:

2xaverage/number of segments.

This would give me the following segments:

range1: 0..0.1496
range2: 0.1496..0.2992
range3: 0.2992..0.17952

One of the issues I see is that I need the first range to start from 0 but the dataset may not contain any zeros. So I need to somehow shift the numbers to the left.

However, I'm not very strong in math and I would really appreciate some guidance to see if this is correct.

Also I'm not sure if I've stated the question and description clear enough so please comment for clarification and I'll edit the answer appropriately.

2

There are 2 best solutions below

0
On BEST ANSWER

In the end I used some of what @plumSemPy said about using the median and not the average.

I sort the data to find the median. I also shift the data by subtracting the smallest value from each number so the set starts at 0.

Finally I use the following formulas to get the ranges:

first range  = 2*median/(number of ranges)*0...2*median/(number of ranges)*1  
second range = 2*median/(number of ranges)*1...2*median/(number of ranges)*2
third range  = 2*median/(number of ranges)*2...2*median/(number of ranges)*3  
0
On

The simple approach is to make the bottom range include the bottom third of the points, the next to include the middle third of the points, the the top range the top third. Another approach would be to make the middle range some number of standard deviations around the mean. Mean $\pm 0.43$ standard deviations should get about a third of the values if the data is normal (it probably isn't). There are other sensible approaches, but if you don't like these the reasons for rejecting them may lead you to the one you like.