Let say that heights of student follow normal distribution. Height range is from 150 ~ 190cm.
What I want to do is to categorize these heights into 10 categories.
This is what I've done:
- Remove outliers
- Sort data by height
- Split this data into 10 groups equally (each group has same number of student (or +,- 1 maybe))
- Add outlier to both edge groups.
So, it could be like this:
- Group 1 : 150 ~ 157.2cm
- Group 2 : 157.3 ~ 161.1cm
- Group 3 : 161.2 ~ 163.5cm
- Group 4 : ...
...
- Group 10 : 183.4 ~ 190com
Finally, I set height of Group 1 students as 0, set height of Group 2 as 1, set height of Group 3 as 2 ... etc.
Is this method statistically suitable method to categorizing normally distributed datas? It seems like not-bad method for me, as non-mathmatics-major person.
Need your advices. Thanks.
You may be making this more complicated than it needs to be. Here are 100 observations randomly generated from the distribution $\mathsf{Norm}(\mu =170,\,\sigma=8),$ rounded to integers, and sorted from smallest to largest. (I used R statistical software, but other software would work as well.)
Tied values make it impossible to find boundaries that put exactly 10 heights in each interval, but the eleven numbers
$$148.5, 160.5, 163.5, 164.5, 166.5, 168.5, \dots, 179.5, 190.5$$
will come reasonably close. Of course, this is easy to do with $n=100$ subjects, but for other sample sizes you can divide $n$ by $10$ and do about the same thing.
Notes: (a) I see no reason to remove the outliers and then add them back later. (b) You didn't explain why you assigned numbers 0, 1, 2, etc. to the intervals. Maybe you have a good reason for this, but I don't see what it might be.
If you want an automated way to find interval boundaries in software, you can find the deciles with software and use those:
When finding quantiles, each software package has its own way of dealing with ties and taking care of sample sizes that are not evenly divisible by 10, so you would get slightly different answers from various programs, but the differences would not be important. You might want to make each boundary a non-integer (e.g, 165.5 instead of 165) so you won't have any heights exactly on a boundary.
Sometimes, it is desirable for intervals to be of the same length instead of having about the same frequencies in each. You can use software to make a histogram. Most histogram programs make intervals of equal length, unless you force some other choice. Here is a histogram from R. I chose non-integer 'cut points' for the intervals.
The histogram doesn't look very symmetrical, but that's just how my fake data happened to be generated. You can check it against the listing of the data given earlier. It isn't usually a good idea to try making a histogram with equal counts in each 'bin' (interval).