I was casually experimenting with Desmos's linear regression and noticed some patterns in the best fit line for points which fit y = x^2.
For (0,0) and (1,1) the best fit line is y = 1x+0 and r = 1 (the correlation coefficient).
For (0,0) , (1,1) , (2,4) , ... , (10,100) the best fit line is y = 10x-15 and r = .963.
And for (0,0) , (1,1) , (2,4) , ... , (20,400) the best fit line is y = 20x-63.333 and r = .965.
My questions are:
- Why is the best fit slope always equal to the # of points - 1?
- Does the y-intercept follow the curve -$1/6 x^2 + 1/2 x - 1/3$ where x is the # of points?
- Does the correlation coefficient approach $1$, and why does it even increase as more points are added?
What is the formula for the line of best fit here?
In this case, because the points we take follow an algebraic pattern, we can calculate a linear regression on the points $\{(0,0), (1,1), (2,4), \dots, (n,n^2)\}$ in terms of $n$, and see what happens.
To do this, we start with the overconstrained system of equations that a line $y=ax+b$ passes through all $n+1$ points: $0 = a\cdot0+b$, $1=a\cdot1+b$, and so on through $n^2 = a\cdot n+b$. In matrix form, $$\begin{bmatrix}1 & 0 \\ 1 & 1 \\ 1 & 1 \\ \vdots & \vdots \\ 1 & n\end{bmatrix}\begin{bmatrix}a \\ b\end{bmatrix} = \begin{bmatrix}0^2 \\ 1^2 \\ 2^2 \\ \vdots \\ n^2\end{bmatrix}.$$ There is no such line; to find the least-squares solution, we multiply by the transpose of the coefficient matrix: $$\begin{bmatrix}1 & 1 & 1 & \cdots & 1\\ 0 & 1 & 2 & \cdots & n\end{bmatrix}\begin{bmatrix}1 & 0 \\ 1 & 1 \\ 1 & 1 \\ \vdots & \vdots \\ 1 & n\end{bmatrix}\begin{bmatrix}a \\ b\end{bmatrix} = \begin{bmatrix}1 & 1 & 1 & \cdots & 1\\ 0 & 1 & 2 & \cdots & n\end{bmatrix}\begin{bmatrix}0^2 \\ 1^2 \\ 2^2 \\ \vdots \\ n^2\end{bmatrix}.$$ This simplifies to a $2\times 2$ system $$\begin{bmatrix}n+1 & \frac{n(n+1)}{2} \\ \frac{n(n+1)}{2} & \frac{n(n+1)(2n+1)}{6}\end{bmatrix} \begin{bmatrix}a \\ b\end{bmatrix} = \begin{bmatrix}\frac{n(n+1)(2n+1)}{6} \\ \frac{n^2(n+1)^2}{4}\end{bmatrix}$$ where the increasingly complicated formulas in the cells are the sum of our $x$-coordinates, the sum of their squares, and the sum of their cubes. Solving this system gets us a general formula for $a$ and $b$: the line of best fit is $$y = nx -\frac{n(n-1)}{6}.$$
(This does fit the conjecture formula in the question.)
Why?
It's hard to answer "why" questions when the real "why" is "because the formula said so".
It is reasonable to expect a slope approximately equal to $n$, because this is the slope from the first to the last point: $\frac{n^2-0}{n-0}$. This doesn't tell us that the slope must be exactly $n$, but sometimes math does nice things for us.
We can also get some insight into the problem by rescaling. Divide all $x$-coordinates by $n$ and all $y$-coordinates by $n^2$. Our $n+1$ points are now $(0,0), (\frac1n, \frac1{n^2}), (\frac2n, \frac4{n^2}), \dots, (1,1)$: they are $n+1$ evenly-spaced points on the graph of $y=x^2$ for $x$ between $0$ and $1$. As $n$ increases, our line of best fit should get closer and closer to some line best approximating that graph...
...and it does; rescaling our line of best fit gives us $y = x - \frac16 + \frac1{6n}$, and this approaches $y = x - \frac16$ as $n$ increases.
What about the correlation coefficient?
The most convenient formula to use here is $$r_{xy} = \frac{n\sum x_i y_i - \sum x_i\sum y_i} {\sqrt{n\sum x_i^2-\left(\sum x_i\right)^2}~\sqrt{n\sum y_i^2-\left(\sum y_i\right)^2}}$$ Here, each sum is one of the familiar expressions $ \frac{n(n+1)}{2}$, $ \frac{n(n+1)(2n+1)}{6}$, or $\frac{n^2(n+1)^2}{4}$, except that $\sum y_i^2 = \sum_{i=0}^n i^4$, which has a more complicated formula I'm scared to write down.
To avoid getting into those weeds, we approximate $\sum x_i \approx n^2/2$, $\sum x_i^2 = \sum y_i \approx n^3/3$, $\sum x_iy_i \approx n^4/4$, and $\sum y_i^2 \approx n^5/5$. (The general rule for the approximation is that when we have a sum of $r^{\text{th}}$ powers, $\sum_{i=0}^n i^r$, it is approximately $n^{r+1}/(r+1)$. One way to get there is to replace the sum by an integral.)
This approximation simplifies the numerator to about $n^5/12$, the first square root to $\sqrt{n^4/12}$, and the second square root to $\sqrt{4n^6/45}$. Overall, the factors of $n$ cancel, and we get $$r_{xy} \approx \frac{\sqrt{15}}{4} \approx 0.968246.$$ Because we took our approximations, this only gets us the limiting behavior: if you like, the correlation coefficient for $y = x^2$ when $0 \le x \le 1$, with line of best fit $y = x - \frac16$. The correlation coefficient for small $n$ will only approach this value. (That may mean it sometimes increases as we increase $n$ - but it won't increase all the way to $1$.)