Group $n$ numbers in $\mathbb{R}$ into $k$ clusters

104 Views Asked by At

Let $x_1\leq x_2\leq...\leq x_n$ be a sequence of real numbers in $\mathbb{R}$. If we want to group the numbers into $k$ groups such that the sum of the within group squared error is minimized, it seems intuitive that the optimal clustering mechanism should be in the form: $$\{x_1,...,x_{j_1}\}, \{x_{j_1+1},...,x_{j_2}\},...,\{x_{j_{k-1}+1},...,x_{n}\}$$ But how can we prove this mathematically with rigor? I've been thinking about this for a while but I believe it is not as easy as it seems to be.

By the sum of within group squared error, I mean, if $G_1,...,G_k$ is a partition of $\{1,...,n\}$, the sume of within group squared error is: $$\sum_{j=1}^k\sum_{i\in G_j}(x_i - \bar{x}_{G_j})^2$$ where $\bar{x}_{G_j}$ is the average of numbers in group $G_j$.

3

There are 3 best solutions below

10
On BEST ANSWER

Show that if a solution is not of this form, it can be improved. That is, prove the contrapositive of "if optimal, then this form."

Edit: I agree that this approach is not so simple to flesh out. This link mentions that the following paper provides a proof: Fisher, W. D. (1958). On grouping for maximum homogeneity. Journal of the American Statistical Association, 53, 789-798. Indeed, the Appendix contains a proof that uses the same approach that I sketched: assume there is $x_r < x_s < x_t$ with $x_r$ and $x_t$ in one group and $x_s$ in a different group, and show that any such grouping can be strictly improved.

1
On

Here is an experiment. I start with 100 random observations split into four groups.

x1 = runif(25,0,50);  x2 = runif(25,0,50)
x3 = runif(25,0,50);  x4 = runif(25,0,50)
x = c(x1, x2, x3, x4);  g=as.factor(rep(1:4,each=25))

Then the MSE(Within) = MSE(Resid) = 198.61.

anova(lm(x~g))

Analysis of Variance Table

Response: x
           Df  Sum Sq Mean Sq F value Pr(>F)
 g          3   316.7  105.57  0.5316 0.6617
 Residuals 96 19066.4  198.61   

However, if I sort the 100 observations from smallest to largest, the groups become more homogeneous, and MSE(Within) = 16.3, much reduced.

y = sort(x)
anova(lm(y~g))

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)    
g          3  17817  5939.0  364.07 < 2.2e-16 ***
Residuals 96   1566    16.3                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I'm not promising that this method gives you the minimum possible MS(Resid). But this the first thing I thought of to reduce MS(Resid).

Notice that the total sum of squares is the same in both cases. It will not change as you rearrange the same observations into groups in different ways: 316.7 + 19066.4 = 19383.1 is the same as 17817 + 1566 = 19383 (except for rounding error when the values are rounded for display).

Also, notice that in both cases $\text{MS(Resid)}$ = $\text{SS(Resid)/DF(Resid)}$ = $\text{SS(Resid)}/96.$ So you can deal with $\text{SS(Resid)},$ if that is more convenient.

Also notice that this is just a math problem without much to do with statistical; practice. If you have four normally distributed groups, combine them, sort them, and then split them into groups by quartiles, you will no longer have normally distributed groups. Here I used groups of 1000 observations each, in order to make nicer histograms. Histograms for the normal groups are in the top row and those for the sorted groups are in the bottom row.

enter image description here

0
On

I expect the average is tripping you up. Here's an idea:


(1) Show that for any $a_1 \leq a_2 \leq \cdots \leq a_k$ (and $a_0 = -\infty, a_{k+1}=\infty$) that $$G_j = \left\{i : \frac{1}{2}(a_{j-1}+a_j) \leq x_i < \frac{1}{2}(a_j+a_{j+1})\right\}$$ minimizes $$\sum_{j=1}^k\sum_{i\in G_j}(x_i-a_j)^2$$ (note: at this step, we don't care if any of the $G_j$ are empty or not)


(2) Prove (or simply note, as it's a standard fact) that if $G_j$ is nonempty, then $$\sum_{i\in G_j}(x_i-\bar{x}_{G_j})^2 \leq \sum_{i\in G_j}(x_i-a_j)^2$$


(3) If you care about making sure the sets in the partition are nonempty, suppose $G_j$ is empty and just take either the max element from $G_{j-1}$ or the minimum element from $G_{j+1}$ and put it in $G_j.$ As long as $k \leq n,$ this shifting around of elements can be done to make sure all sets are nonempty (and is easily seen to never increase the sum we're trying to minimize)


Now, just start with a partition $F_1, F_2, \ldots, F_k$ of $\{1,2,\ldots,n\}$ which minimizes the sum. Set $a_j = \bar{x}_{F_j}$ for $j=1,2,\ldots,k,$ reorder if necessary to make $a_1\leq a_2 \leq \cdots \leq a_k$ and apply all the above steps.