how to find two numbers to minimize the sum of squares

174 Views Asked by At

Suppose we are given a sequence of positive numbers $0<a_1<a_2< \cdots < a_n$.

Step 1. Choose an integer $m$ where $m \in \{1,2,\cdots,n\}$. After choosing $m$, we divide our numbers into two subgroups $\{a_1,\cdots,a_m\}$ and $\{a_{m+1},\cdots,a_n\}$.

Step 2. Given two numbers $a', a'' \in \mathbb{R}$, we can calculate the sum of squares in the two subgroups $$SS = \sum_{i=1}^{m} (a_i-a')^2+\sum_{j=m+1}^n (a_j-a'')^2.$$

My question is how to choose $m, a', a''$ to minimize the $SS$.

The $a'$ and $a''$ are easy to solve. In fact, $a'$ should be the mean of the first subgroup, i.e. $\frac{a_1+\cdots + a_m}{m}$ and $a''$ should be the mean of the second subgroup, i.e. $\frac{a_{m+1}+\cdots + a_n}{n-m}$.

To solve $m$, I guess that we can firstly compute the mean $\bar{a}$ of the whole group, i.e. $\frac{a_1+\cdots+a_n}{n}$ and then $m$ should be the largest integer such that $a_m<\bar{a}$. But I don't know how to prove it.

1

There are 1 best solutions below

4
On

The easiest way would likely be to calculate your target sum from step 2 for each $m\in\{1, \dots, n\}$, giving you $m$ sums $S_1, \dots, S_m$. Then just pick the $m$ giving you the smallest such $S_m$.

You can either do this straightforwardly or use an online algorithm to minimize the amount of calculation if this is an issue to you. Online estimation of variance with limited memory gives you an online way of calculating $\sum_{i=1}^m(a_i-a')^2$ for $m=1, \dots, n$ and should be easily modified for the second sum.

Unfortunately, it is not the case that $S_m$ is convex as a function of $m$ (which would allow you to do a faster search), e.g. for $$a=(0.01,0.09,0.48,0.49,0.55,0.6,0.86,0.95,0.97,1).$$

S

R code:

aa <- c(0.01, 0.09, 0.48, 0.49, 0.55, 0.6, 0.86, 0.95, 0.97, 1)
S <- sapply(seq_along(aa),function(ii)
    sum((aa[1:ii]-mean(aa[1:ii]))^2)+sum((aa[-(1:ii)]-mean(aa[-(1:ii)]))^2))
plot(S,type="o",xlab="m",pch=19)

In addition, your hypothesis (calculate $\bar{a}$, pick the largest $m$ such that $a_m<\bar{a}$) does not work, either: use

$$ a = (0.06, 0.26, 0.38, 0.51, 0.61, 0.76, 0.81, 0.94, 0.96, 0.98). $$

Then

$$ \bar{a} = 0.627\quad\text{and}\quad\max\{m|a_m<\bar{a}\}=5, $$

but (rounding to the third digit)

$$ S=(0.559, 0.371, 0.252, 0.214, 0.224, 0.334, 0.441, 0.622, 0.777, 0.916) $$

and

$$ S_4=\min S.$$

(Incidentally, note that in this case $S_m$ is convex, so your hypothesis does not even work in the convex case.)