Best way to categorizing normally distributed data?

680 Views Asked by At

Let say that heights of student follow normal distribution. Height range is from 150 ~ 190cm.

What I want to do is to categorize these heights into 10 categories.

This is what I've done:

  1. Remove outliers
  2. Sort data by height
  3. Split this data into 10 groups equally (each group has same number of student (or +,- 1 maybe))
  4. Add outlier to both edge groups.

So, it could be like this:

  • Group 1 : 150 ~ 157.2cm
  • Group 2 : 157.3 ~ 161.1cm
  • Group 3 : 161.2 ~ 163.5cm
  • Group 4 : ...

...

  • Group 10 : 183.4 ~ 190com

Finally, I set height of Group 1 students as 0, set height of Group 2 as 1, set height of Group 3 as 2 ... etc.

Is this method statistically suitable method to categorizing normally distributed datas? It seems like not-bad method for me, as non-mathmatics-major person.

Need your advices. Thanks.

1

There are 1 best solutions below

3
On

You may be making this more complicated than it needs to be. Here are 100 observations randomly generated from the distribution $\mathsf{Norm}(\mu =170,\,\sigma=8),$ rounded to integers, and sorted from smallest to largest. (I used R statistical software, but other software would work as well.)

x = sort(round(rnorm(100, 170, 8)))
summary(x)

  Min. 1st Qu. Median    Mean  3rd Qu.    Max. 
149.0   164.0   169.0   169.6   175.0   190.0

x
 [1] 149 150 151 153 153 155 155 160 160 161 161 162 162 162 162 162 163 163 163 163
[21] 163 164 164 164 164 164 164 165 165 165 165 165 165 165 165 166 166 166 166 166
[41] 167 167 168 168 168 168 169 169 169 169 169 170 170 170 170 171 171 171 171 171
[61] 172 172 173 173 173 173 173 173 173 174 174 174 174 175 175 175 176 176 176 176
[81] 177 177 177 177 177 178 179 179 180 181 181 181 182 184 184 184 186 186 188 190

Tied values make it impossible to find boundaries that put exactly 10 heights in each interval, but the eleven numbers

$$148.5, 160.5, 163.5, 164.5, 166.5, 168.5, \dots, 179.5, 190.5$$

will come reasonably close. Of course, this is easy to do with $n=100$ subjects, but for other sample sizes you can divide $n$ by $10$ and do about the same thing.

Notes: (a) I see no reason to remove the outliers and then add them back later. (b) You didn't explain why you assigned numbers 0, 1, 2, etc. to the intervals. Maybe you have a good reason for this, but I don't see what it might be.


If you want an automated way to find interval boundaries in software, you can find the deciles with software and use those:

quantile(y, seq(.0,1, by=.1))
   0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
149.0 161.0 163.0 165.0 166.6 169.0 171.4 174.0 176.2 181.0 190.0 

When finding quantiles, each software package has its own way of dealing with ties and taking care of sample sizes that are not evenly divisible by 10, so you would get slightly different answers from various programs, but the differences would not be important. You might want to make each boundary a non-integer (e.g, 165.5 instead of 165) so you won't have any heights exactly on a boundary.


Sometimes, it is desirable for intervals to be of the same length instead of having about the same frequencies in each. You can use software to make a histogram. Most histogram programs make intervals of equal length, unless you force some other choice. Here is a histogram from R. I chose non-integer 'cut points' for the intervals.

enter image description here

The histogram doesn't look very symmetrical, but that's just how my fake data happened to be generated. You can check it against the listing of the data given earlier. It isn't usually a good idea to try making a histogram with equal counts in each 'bin' (interval).