Suppose I had a data set as follows:
168.95, 176.83, 178.13, 179.61, 179.44, 172.83, 173.37, 174.06, 174.94, 175.43, 175.73, 175.5, 173.78, 174.06, 172.71, 174.3, 178.38, 178.43, 177.18, 175.34, 176.16, 177.03, 178.2, 178.19, 176.42, 177.4, 178.15, 173.1, 173.4, 174.42, 173.87, 171.85, 172.37, 172.73, 172.28, 173.65, 172.35, 171.66, 172.8, 171.97, 168.56, 168.74, 168.38, 169.78, 169.79, 170.07, 171.57, 169.46, 169.32, 170.3, 154.83, 171.33, 169.61, 170.08, 170.11, 170.22, 172.16, 169.64, 169.86, 170.56, 171.64, 166.91, 167.93, 170.35, 170.56, 168.16, 168.41, 170.54, 168.89, 168.03, 169.28, 169.47, 171.24, 169.15, 170.43, 168.36, 169.19, 168.58, 168.67, 169.19, 167.69, 168.4, 168.27, 167.89, 168.18, 168.91, 168.16, 168.98, 168.79, 166.87, 168.36, 170.22, 168.36, 169.01, 168.45, 168.73, 168.24, 170.12, 169.13, 168.95, 169.06, 168.47, 168.68, 170.22, 170.04, 167.92, 168.38, 168.14, 169.27, 170.31, 168.68, 170.59, 168.95, 170.35, 152.99, 153.69, 156.8, 156.45, 156.03, 156.01, 156.63, 156.68, 155.33, 156.31, 155.65, 154.64, 156.02, 155.94, 154.5, 154.53, 154.81, 156.15, 155.58, 155.55, 154.54, 154.66, 155.09, 155.99, 155.27, 155.11, 155.22, 155.98, 156.04, 153.86, 158.48, 158.34, 158.3, 159.29, 158.4, 158.69, 159.17, 159.13, 158.02, 158.7, 157.94, 158.81, 159.14, 159.1
And another data set as follows:
125.14, 130.07, 130.45, 127.8, 128.12, 128.3, 129.18, 128, 127.84, 128.4, 128.86, 128.95, 128.29, 129.24, 128.33, 127.9, 128.96, 127.96, 128.08, 128.96, 128.6, 129.27, 128.74, 129.82, 129.72, 129.65, 130.12, 129.02, 129.8, 129.5, 129.51, 129.97, 129.95, 130.45, 130.51, 130.51, 129.44, 128.42, 129.33, 128.65, 129.15, 129.71, 128.63, 130.17, 128.96, 127.8, 128.12, 128.3, 129.18, 128, 127.84, 128.4, 128.86, 128.95, 128.29, 129.24, 128.33, 127.9, 128.96, 127.96, 128.08, 128.96, 128.6, 129.27, 128.74, 129.82, 129.72, 129.65, 130.12, 129.02, 129.8, 129.5
If you are not allowed to filter or sort the data, as they correspond to real-world, real-time values, what algorithmic method would you suggest to group the data into the least number of statistically significant groups, that can be used for both data sets (and possibly more)?
My initial idea was to take a global mean, standard deviation and variance, then split the groups into two and take local values for the aforementioned. I would then compare the two values and if they fell below par, the group(s) that failed the comparison would be further split into two. This would keep on going until the standard deviation and variance within the groups was small enough so as to not pose a statistical bias.
If anyone can think of a method that is more computationally efficient and/or mathematically rigorous, I would appreciate it greatly.
Also, I am new to numerical methods in general and would welcome any advice you had in this endeavor.
The EM (Expectation-Maximisation) algorithm is quite efficient in finding the initial distribution assuming it is Normal. The way it works is that you need initial guesses:
how many groups of data you might have.
what are you guesses for the mean and variance of each group.
All of those information can normally be found by plotting the data. By plotting the data you gave, it will be pretty obvious that you have 2 groups. By taking the "middle" of each you can have a not too bad estimate of the initial mean for each group and set the variance to be 1 for simplicity.
The way the EM algorithm works is that it will use the initial guess to find a better estimation of the distributions according to the data. It will then use the new guess to calculate another one. It can be shown that the result converges and the distribution found is better and better. At every step the algorithm calculates the log-likelihood which increases at every-step. The algorithm will stop eventually when the difference between 2 consecutive log-likelihood is smaller than some tolerance which you should state.
A few problems :
If you assign too many groups at the beginning, the algorithm will overfit the data sometimes creating a zero variance when it tries to find the distribution of 1 point.
It doesn't automatically find how many groups you have at the beginning...
Ways around are to try all sensible number of groups that you can see when plotting the values on a line so say 1-3 groups. Try all of them, find the log-likelihood for each and then I suggest using the AIC formula to penalise overfitting. The smallest AIC will give you a good estimate of the initial distributions of your data.
I put links to the corresponding Wiki pages, I find that the animation for the EM algorithm shows very well what it does in a 2D case. The end result of this algorithm will be a distribution for each group you have and a weight according to how many data you have.