How I divide my data in intervals with almost the same amount of data?

2.8k Views Asked by At

I have data (6 millions of numbers) and I want to divide my data by intervals (range of number). Each interval should have almost the same amount of data.

I know the way for finding intervals with ranges equals between themselves(as here). But I want that each interval defined have the same amount of data

What's the process or formula for finding that?

UPDATE

I need to do a manual calculus. I haven't software as Matlab or SPSS. The 6 millions of number is stored in a table of database. I going to use PL/SQL.

2

There are 2 best solutions below

1
On BEST ANSWER

Sort your numbers (sorting 6 million numbers in something like matlab should take a fraction of a second). Then take the first million, the second million, etc.

3
On

If you want four intervals with approximately the same number of observations in each, then use the sample quartiles as the five 'bin' boundaries. This idea generalizes. For example, if you want 10 bins, use deciles. And so on.

Different software packages use (slightly) different rules for finding quantiles, especially if there are observations with tied values. However, because you have very large samples, and only want bins with 'almost' equal numbers of observations, these differences should not be a problem.

If you are going to use this binning as the basis for making a histogram, you need to be very careful how you interpret the vertical scale of the histogram. It needs to be a 'density' scale, not a 'frequency' scale. The fundamental principle of a histogram is that each observation out of $n$ is represented by $1/n$th of the total area under the histogram.

Here is an example with a million simulated observations from a gamma distribution with shape parameter 10 and rate parameter 0.01 (thus mean 1000). Each of the 10 bars of the resulting histogram represents 100,000 observations. This is not the style of histogram one would use for an ordinary data summary. It is contrived to illustrate your idea of bins with equal counts.

In R statistical software, the hist parameter print=F gives a printout of various characteristics of the histogram, some of which I have shown. (Intensities are used to make bars of the proper heights, so that each of the 10 bars has the same area.)

 n = 10^6; x = rgamma(n, 10, 1/100)
 bin.bdy = quantile(x, seq(1:9)/10)
 hist(x, br=c(min(x), bin.bdy, max(x)), col="wheat")

enter image description here

hist(x, br=c(min(x), bin.bdy, max(x)), plot=F)
$breaks
                10%       20%       30%       40%       50%       60%       70%       80%       90%           
 113.1149  621.9300  728.9403  813.3358  890.6540  966.7731 1047.5072 1138.7230 1251.9731 1420.9409 3292.5520 

$counts
## 100000 100000 100000 100000 100000 100000 100000 100000 100000 100000

$intensities
         10%          20%          30%          40%          50%          60%          70%          80%          90% 
0.0001965351 0.0009344897 0.0011848970 0.0012933562 0.0013137310 0.0012386336     0.0010963020 0.0008830011 0.0005918289 0.0000534299 

Note: Technically, the minimum and the maximum are the 0th and 100th quantiles, respectively. I had to treat them separately here to avoid error messages, which I suppose to be due to tiny errors in floating point arithmetic for this very unusual style of histogram.