What's "computer's method" in calculating beta for a regression line

270 Views Asked by At

Since this is not from a problem set but just something I've been thinking, please forgive me for possible (wording) errors.

Consider a set of data points $(x_n,b_n)$. We would like to know the function of the regression line for these points.

To obtain the function of the regression line, we can substitute $(x_n,b_n)$ into $Ax=b$.

For example, for (1,2) (3,5), we know $1\cdot\beta+t=2$ and $3\cdot\beta+t=5$, which can be expressed as $$ \left[ \begin{array}{cc|c} 1&1&2\\ 3&1&5 \end{array} \right] $$

From here, we may obtain the beta value through solve the $AA^T x^*=A^T b$.

The question has now become, does the computer adopt this method to compute beta?

Consider every multiplication as an operation, this method seems to take a lot of operations (in $A A^T$ and $A^T b$ for example). How would this method be efficient at all?

What I am looking for is either a justification that justifies the efficiency of this method (implying that the computer is using this method to compute beta), or an alternative method that the computer employs and provide a justification that why that alternative method is more efficient than this illustrated one.

5

There are 5 best solutions below

4
On BEST ANSWER

The calculation "using matrices" ends up exactly the same as what user1892304 obtained. If the data points are $(x_i, y_i)$, the matrix equations are $$ A^T A \pmatrix{\beta_1 \cr \beta_2} = A^T B $$ where $$ A = \pmatrix{x_1 & 1\cr x_2 & 1\cr \ldots & \ldots\cr x_n & 1\cr},\ B = \pmatrix{y_1\cr y_2\cr \ldots\cr y_n\cr}$$ so that $$ A^T A = \pmatrix{ \sum_i x_i^2 & \sum_i x_i\cr \sum_i x_i & n\cr},\ A^T B = \pmatrix{\sum_i x_i y_i\cr \sum_i y_i\cr} $$

Completely blind matrix calculation, not taking advantage of the facts that $A^T A$ is symmetric (so the $(2,1)$ entry is the same as the $(1,2)$ entry) and that the second column of $A$ is all $1$'s, will do some unnecessary arithmetic, but it's still $O(n)$, and many modern linear algebra codes are highly optimized for today's processors, so these computations will be blindingly fast. Then the final solution of a $2 \times 2$ system is very easy.

0
On

The least-squares linear regression has the following closed form, which you can easily derive by evaluating the partial derivatives of the squared residue sum and setting each to zero:

$$\beta_1 = \frac{n\Sigma(x_nb_n) - \Sigma(x_n)(b_n)}{n\Sigma(x_n^2) - \Sigma(x_n)^2}\\[1em] \beta_2 = \frac{\Sigma(x_n)\Sigma(x_nb_n) - \Sigma(x_n^2)\Sigma(b_n)}{\Sigma(x_n)^2 -n\Sigma(x_n^2)}. $$

such that $y=\beta_1x+\beta_2$, where $\Sigma(A_n)=A_1+A_2+\cdots +A_n$ (assuming we have a dataset of $n$ points). This makes the computation more efficient than computing the coefficients using matrices.~

Edit: disregard the last sentence; Robert Israel has pointed out that both computations turn out to be $\text{O}(n)$.

7
On

Well the answer is sort of orthogonal to the math problem. It depends on what type of computer you're talking about. On the extreme there are two types of computers. A simple computer which has one processor, and a supercomputer which has a floating point vector processor with thousands of processors. A simple computer can't really do matrix operations in a single operation cycle. It simulates such an mathematical operation over many cycles. A "super-computer" with a floating point array processor could break the problem into parts such that many processors could calculate in each cycle. For a linear regression line I doubt that such a supercomputer could do the calculation in much less time. However if the problem was over 6 degrees of freedom then a supercomputer would be much faster since it can really do matrix mathematics.


There is another aspect for "efficiency". Efficiency for who? APL is a programming language that allows matrix operations. The programmer doesn't know or care how the matrix operations are "really" calculated. Obviously the calculations need to finish in a "reasonable time" so a floating point arrayed processor would have the computer and the human working at the "same level". It could do more work per cycle.

The problem with just linking any old processors together to make a supercomputer is that you have to coordinate all the processing. So if you consider the IBM supercomputer that played Jeopardy, that was not just a big floating point arrayed processor. The program inputs were sliced and diced into multiple tasks and each task was sliced and diced multiple ways. Thus there was a lot of "tuning" all the thousands of algorithms to make them work together better.

The gist is that just because you have a supercomputer doesn't mean that any old problem can be solved much faster. There is a lot of art into breaking an arbitrary problem into algorithms and then optimizing the algorithms.

0
On

For large data sets and large models especially when the matrices A are sparse ( containing a very small fraction non-zero values ), a popular choice is the Krylov Subspace family of algorithms. These include conjugate gradient, generalized minimized residual, Arnoldi iterations, Lanczos and various types of preconditioning that can be applied to them.

These are built on the principle that the subspaces spanned by $\{ v,Av,A^2v,\cdots\}$ can be used to approximate $A^{-1}v$ in various ways. The nice thing here is that we can calculate the vectors spanning the space as a sequence of Matrix-Vector multiplications, only need to store a couple of vectors (no extra matrices) and so on. Also it is quite friendly for parallellization which is increasingly important as modern office machines usually contain multiple core CPUs and thousands-of-cores GPUs.

3
On

When $A$ is a generic dense matrix the linear least-squares problem $\min_x \|Ax - b\|_2$ is never, ever computed numerically by solving the symmetric system $A^TA x = A^T b$. "Squaring" a matrix through the matmul $A^T A$ will make the system more poorly conditioned, which leads to greater numerically instability.

Instead, given matrix $m \times n$, we compute the thin QR factorization $A = QR$, where $Q$ is $m \times n$ column orthogonal and $R$ is $n \times n$ upper triangular. This factorization is itself typically computed by doing a sequence of Householder reflections. For the case $n = 2$ (the 1D linear regression case), this factorization costs approximately $8m$ flops.

After the QR factorization is computed, we solve $2 \times 2$ linear system $Rx = Q^t b$. Forming this system costs approximately $3m$ flops, and solving the $2 \times 2$ is constant time in $m$. This leads to a full linear least-squares solve time of approximately $11m$ flops.