Best way to generalize a distribution of data into smaller subsets?

Question

Best way to generalize a distribution of data into smaller subsets?

120 Views Asked by Bumbble Comm At 29 Mar 2026 - 11:02

Suppose I had a data set as follows:

168.95, 176.83, 178.13, 179.61, 179.44, 172.83, 173.37, 174.06, 174.94, 175.43, 175.73, 175.5, 173.78, 174.06, 172.71, 174.3, 178.38, 178.43, 177.18, 175.34, 176.16, 177.03, 178.2, 178.19, 176.42, 177.4, 178.15, 173.1, 173.4, 174.42, 173.87, 171.85, 172.37, 172.73, 172.28, 173.65, 172.35, 171.66, 172.8, 171.97, 168.56, 168.74, 168.38, 169.78, 169.79, 170.07, 171.57, 169.46, 169.32, 170.3, 154.83, 171.33, 169.61, 170.08, 170.11, 170.22, 172.16, 169.64, 169.86, 170.56, 171.64, 166.91, 167.93, 170.35, 170.56, 168.16, 168.41, 170.54, 168.89, 168.03, 169.28, 169.47, 171.24, 169.15, 170.43, 168.36, 169.19, 168.58, 168.67, 169.19, 167.69, 168.4, 168.27, 167.89, 168.18, 168.91, 168.16, 168.98, 168.79, 166.87, 168.36, 170.22, 168.36, 169.01, 168.45, 168.73, 168.24, 170.12, 169.13, 168.95, 169.06, 168.47, 168.68, 170.22, 170.04, 167.92, 168.38, 168.14, 169.27, 170.31, 168.68, 170.59, 168.95, 170.35, 152.99, 153.69, 156.8, 156.45, 156.03, 156.01, 156.63, 156.68, 155.33, 156.31, 155.65, 154.64, 156.02, 155.94, 154.5, 154.53, 154.81, 156.15, 155.58, 155.55, 154.54, 154.66, 155.09, 155.99, 155.27, 155.11, 155.22, 155.98, 156.04, 153.86, 158.48, 158.34, 158.3, 159.29, 158.4, 158.69, 159.17, 159.13, 158.02, 158.7, 157.94, 158.81, 159.14, 159.1

And another data set as follows:

125.14, 130.07, 130.45, 127.8, 128.12, 128.3, 129.18, 128, 127.84, 128.4, 128.86, 128.95, 128.29, 129.24, 128.33, 127.9, 128.96, 127.96, 128.08, 128.96, 128.6, 129.27, 128.74, 129.82, 129.72, 129.65, 130.12, 129.02, 129.8, 129.5, 129.51, 129.97, 129.95, 130.45, 130.51, 130.51, 129.44, 128.42, 129.33, 128.65, 129.15, 129.71, 128.63, 130.17, 128.96, 127.8, 128.12, 128.3, 129.18, 128, 127.84, 128.4, 128.86, 128.95, 128.29, 129.24, 128.33, 127.9, 128.96, 127.96, 128.08, 128.96, 128.6, 129.27, 128.74, 129.82, 129.72, 129.65, 130.12, 129.02, 129.8, 129.5

If you are not allowed to filter or sort the data, as they correspond to real-world, real-time values, what algorithmic method would you suggest to group the data into the least number of statistically significant groups, that can be used for both data sets (and possibly more)?

My initial idea was to take a global mean, standard deviation and variance, then split the groups into two and take local values for the aforementioned. I would then compare the two values and if they fell below par, the group(s) that failed the comparison would be further split into two. This would keep on going until the standard deviation and variance within the groups was small enough so as to not pose a statistical bias.

If anyone can think of a method that is more computationally efficient and/or mathematically rigorous, I would appreciate it greatly.

Also, I am new to numerical methods in general and would welcome any advice you had in this endeavor.

Original Q&A

There are 2 best solutions below

Bumbble Comm On 17 Dec 2013 - 5:13

Being that "statistically significant" means anything, and it depends on what you want to achieve, I would proceed in the following way:

decide which variance is "unacceptable" (it must be > 0 for this to work)
start collecting the data in one single group.
assign a new data point to the group whose variance is changed the least by the addition
if the variance of a group increases, because the addition of a data point, above your threshold, then consider the group as a whole, and separate it in two subgroups, trying to minimize the variances of each half.

You may use this method, and see if it applies well to your data. You can also avoid splitting if the "shape" or significancy of the splitted groups is less than the significancy of the unsplitted group (because for instance the means are not separated enough)

**Bumbble Comm** · Accepted Answer

The EM (Expectation-Maximisation) algorithm is quite efficient in finding the initial distribution assuming it is Normal. The way it works is that you need initial guesses:

how many groups of data you might have.
what are you guesses for the mean and variance of each group.

All of those information can normally be found by plotting the data. By plotting the data you gave, it will be pretty obvious that you have 2 groups. By taking the "middle" of each you can have a not too bad estimate of the initial mean for each group and set the variance to be 1 for simplicity.

The way the EM algorithm works is that it will use the initial guess to find a better estimation of the distributions according to the data. It will then use the new guess to calculate another one. It can be shown that the result converges and the distribution found is better and better. At every step the algorithm calculates the log-likelihood which increases at every-step. The algorithm will stop eventually when the difference between 2 consecutive log-likelihood is smaller than some tolerance which you should state.

A few problems :

If you assign too many groups at the beginning, the algorithm will overfit the data sometimes creating a zero variance when it tries to find the distribution of 1 point.
It doesn't automatically find how many groups you have at the beginning...

Ways around are to try all sensible number of groups that you can see when plotting the values on a line so say 1-3 groups. Try all of them, find the log-likelihood for each and then I suggest using the AIC formula to penalise overfitting. The smallest AIC will give you a good estimate of the initial distributions of your data.

I put links to the corresponding Wiki pages, I find that the animation for the EM algorithm shows very well what it does in a 2D case. The end result of this algorithm will be a distribution for each group you have and a weight according to how many data you have.

Best way to generalize a distribution of data into smaller subsets?

There are 2 best solutions below

Related Questions in STATISTICS

Related Questions in ALGORITHMS

Related Questions in COMPUTER-SCIENCE

Related Questions in DATA-ANALYSIS

Trending Questions

Popular # Hahtags

Popular Questions