For a linear regression of $\{(i,y_i)\}_{i=0}^{n-1}$, where $(y_i)$ is increasing and non-negative, is the $y$-intercept at least $-y_{n-1}$?

155 Views Asked by At

Suppose we have a set of data points $\{(i,y_i)\}_{i=0}^{n-1}$, where $y_i$ are non-negative integers and where $(y_i)_{i=0}^{n-1}$ is an increasing sequence.

Question: In a simple linear regression for this data set, is it true that the $y$-intercept at least $-y_{n-1}$?

For example in $\{(i,2^i)\}_{i=0}^{16}$, we have the $y$-intercept as $-10280$ which is greater than $-2^{16}=-65536$.

enter image description here

Bonus question: Given fixed $n$ and $y_{n-1}$, is there a formula for the minimum $y$-intercept over all such sequences?

1

There are 1 best solutions below

1
On BEST ANSWER

Let me replace the property "increasing" by the weaker "non-decreasing". This does not change the problem too much, but it makes the computation less messy.

It is well-known that the best linear function $a+bx$ satisfies the equations $$ \left(\sum_i 1\right)a + \left(\sum_i x_i\right)b = \sum_i y_i \\ \left(\sum_i x_i\right)a + \left(\sum_i x_i^2\right)b = \sum_i x_iy_i. $$ (Differentiate the function $$ f(a,b) = \sum_i(a+bx_i-y_i)^2 = (a,b,1) \cdot\left( \sum_i \begin{pmatrix} 1 \\ x_i \\ -y_i \end{pmatrix} (1, x_i, -y_i) \right) \cdot\begin{pmatrix} 1 \\ a \\ b \end{pmatrix} $$ to see that.)

In our case $x_i=i$, and the equations are $$ n\cdot a + \frac{n(n-1)}2\cdot b = \sum_{i=0}^{n-1} y_i \\ \frac{n(n-1)}2\cdot a + \frac{n(n-1)(2n-1)}6\cdot b = \sum_{i=0}^{n-1} iy_i .$$ By elimianting the slope $b$, the $y$-intercept is $$ a = \frac6{n(n+1)} \sum_{i=0}^n \left(\frac{2n-1}3-i\right) y_i. $$ In the last sum, the coefficient $\frac{2n-1}3-i$ is nonnegative for $i\le \frac{2n-1}3$ and it is negative for $i\ge \frac{2n}3$. In order to obtain a lower bound, omit the nonnegative terms, and apply the upper bound $y_i\le y_{n-1}$ where the coefficient is positive: $$ a \ge \frac6{n(n+1)} \left(\sum_{\frac{2n-1}3 <i\le n-1} \bigg(\frac{2n-1}3-i\bigg)\right) y_{n-1}. \tag{*} $$ This is the lowest possible of $a$ if $n$ and $y_{n-1}$ are fixed. For a precise evaluation we should consider three cases depending on the residue class of $n$ modulo $3$.

In $(*)$ the terms are all negative, they form an arithmetic progression, the first element is $\tfrac13$, $\tfrac23$ or $1$, the last element is $\frac{2n-1}3-(n-1)=-\frac{n-2}3$, so the average is not less than $\frac12\left(-\frac{n-2}3-1\right)=-\frac{n+1}6$. The number of terms is at most $n-\frac{2n}3=\frac{n}3$. Hence, $$ a \ge \frac6{n(n+1)} \left( \frac{n}3 \cdot \frac{-(n+1)}6 \right) y_{n-1} = -\frac13 y_{n-1}. $$

(Hope I did not make too serious miscalculations....)