Reasoning behind choosing appropriate decomposition when solving linear equation.

1k Views Asked by At

Now I am reading the book "Introduction to linear algebra" by Gilbert Strang. I understand all the technical details regarding LU, QR and SVD decompositions, but I get completely confused when it comes to choosing between them for a particular case.

I also know the following time complexities:

  1. LU decomposition: $\mathcal O\left(n^3\right)$ for square matrix of size $n \times n$;

  2. QR decomposition: $\mathcal O\left(mn^3\right)$ for rectangular matrix of size $m \times n$;

  3. SVD decomposition: $\mathcal O\left(n^3\right)$ for square matrix of size $n \times n$.

Also, time complexity for solving $Ax = b$ as $x = A^{-1}b$ is $\mathcal O\left(n^2\right)$ if we already know $A^{-1}$. But if you already have your matrices $L$ and $U$, time complexity for solving $LUx = b$ is the same. So, as I know, in this case using LU decomposition is only considered better for numerical reasons. QR decomposition doesn't speed up the process either.

QR decomposition can make calculation of least squares solution simpler, since the equation becomes $\hat x = R^{-1}Q^{\mathrm T}b$. But taking into account the fact you first need to calculate QR decomposition itself and only after that apply least squares, then this QR decomposition doesn't speed up anything.

I also know it's a good idea to use LU or QR, when you always have the same matrix $A$ and different $b$'s. But not considering this use case, are all these decompositions used only for numerical stability? Because I don't actually see how they can speed up calculations.

But it seemed to me those are just little pieces of a big picture, since I am not sure what decomposition to choose when it comes to a real problem. I encountered a lot of related questions on this site, but they all seemed to ask about some particular detail regarding some particular decomposition whereas I'm asking about all them at once.

Could someone make it clear when one should use each decomposition and why? Some examples of use cases would be highly appreciated.

1

There are 1 best solutions below

0
On

The applications of the LU, QR ,and SVD decompositions are diverse as their main purpose are not only for solving least square problems or systems of linear equations but to also perform some accurate operations that would best approximate some important properties of the matrix that you have (ex: rank, condition number, etc.) which can help you make some conclusions on whatever problem you are working on. I shall attempt to clarify all the concerns/misunderstandings that you might have raised in your questions.

$(1)$ : Also, time complexity for solving $Ax=b$ as $x=A^{-1}b$ is $\mathcal{O}(n^{2})$ if we already know $A^{-1}$.

Computing $x$ using the inverse of $A^{-1}$ is a major crime in the field of numerical linear algebra because recall that we are not working in exact arithmetic operations but rather in finite precision up to epsilon machinery. Matrices in many cases are ill-conditioned and numerically computing the inverse of a matrix can yield to large magnitude of error especially if the relationship between your entries are a fraction of $10^{m}$ for large $m$. For instance, consider : $$ \mathbf{W}:=\begin{bmatrix}\epsilon & 1/\epsilon \\ 1+\epsilon & \epsilon \end{bmatrix} $$ In exact arithmetic, we have : $$ \mathbf{W}^{-1}=\frac{1}{\epsilon^{2}-\frac{1}{\epsilon}-1}\begin{bmatrix}\epsilon & -1-\epsilon \\ -1/\epsilon & \epsilon \end{bmatrix} $$ For essentially small $\epsilon$. This in finite precision is not the case as there are things we call floating point error where $\epsilon$ compared to $\frac{1}{\epsilon}$ is negligible which on any math software you are working on will cause the following output : $$ \mathbf{W}^{-1}=\frac{1}{-\frac{1}{\epsilon}}\begin{bmatrix}\epsilon & -1 \\ -1/\epsilon & \epsilon \end{bmatrix}=-\epsilon\begin{bmatrix}\epsilon & -1 \\ -1/\epsilon & \epsilon \end{bmatrix}\neq\frac{1}{\epsilon^{2}-\frac{1}{\epsilon}-1}\begin{bmatrix}\epsilon & -1-\epsilon \\ -1/\epsilon & \epsilon \end{bmatrix} $$ Therefore, whenever you wish to solve a system $\mathbf{Ax=b}$ with $\mathbf{A}$ being invertible, do not attempt to solve this by computing its inverse unless you are forced to.

$(2)$ : But if you already have your matrices $L$ and $U$, time complexity for solving $LUx=b$ is the same. So, as I know, in this case using LU decomposition is only considered better for numerical reasons.

Two main points to emphasize on here. First, you are correct that the complexity of solving $\mathbf{LUx=b}$ provided that both $\mathbf{L}$ and $\mathbf{U}$ are known. Otherwise, the complexity would be $\mathbf{O}(\frac{2}{3}n^{3})$. The second point is that its not always true that LU decomposition guarantees numerical stability in any case. The main criteria for the LU decomposition to hold is that the matrix has the principal minor property and this is not always the case. Instead, to guarantee supreme numerical stability using LU decomposition one should then use Gaussian elimination with scaled partial pivoting where for a matrix $\mathbf{A}\in\mathbb{R}^{n\times n}$, the matrix is decomposed to a permuted matrix of the identity matrix we call $\mathbf{P}$ such that $\mathbf{A}=\mathbf{PLU}$. The cost of this operations is same as that of the ordinary LU decomposition.

$(3)$ : QR decomposition can make calculation of least squares solution simpler, since the equation becomes $\hat{x}=R^{-1} Q^{\mathrm{T}} b$. But taking into account the fact you first need to calculate $\mathrm{QR}$ decomposition itself and only after that apply least squares, then this $\mathrm{QR}$ decomposition doesn't speed up anything.

First of all, QR decomposition is indeed the standard way of solving least square problems effectively with much simplicity which is why MATLAB has its own built in function $\mathsf{lsqr(.)}$ which attempts to solve the least square problem using QR decomposition. Furthermore, in the numerical sense the way to compute QR decomposition is by using Householder transformation instead of Gram-Schmidt process which is something good because it helps to avoid the issue of loss of orthogonality that happens frequently if we were to use Gram-Schmidt. The Householder transformation is a bit complicated procedure with cost of $\mathcal{O}(2n^{3})$ which can be time-consuming but ensures stability.

I also know it's a good idea to use LU or QR, when you always have the same matrix A and different b's. But not considering this use case, are all these decompositions used only for numerical stability? Because I don't actually see how they can speed up calculations.

One thing you forgot to mention is solving least square problems is the SVD which by most is far more numerically stable than both QR and LU. However, its the most expensive one and since you brought up speed as a factor. There exists a matrix decomposition called Cholesky decomposition which has speed as its advantage over the other decompositions. This decomposition results from the LU decomposition if we assume the matrix $\mathbf{A}$ is symmetric positive definite (SPD) which would imply that $\mathbf{A}$ has the PMP and thus we may write : \begin{align*}\mathbf{A}&=\mathbf{LU}\\ &=\mathbf{LDL^{\top}}\\ &=\mathbf{LD^{1/2}D^{1/2}L^{\top}}\\ &=\mathbf{BB^{\top}}\end{align*} Computing the cholesky decomposition required $\mathcal{O}(\frac{1}{3}n^{3})$ which is half the cost of the LU decomposition and the way to solve least squares using this decomposition is by performing substitution in the normalized system : $$ \mathbf{A^{\top}Ax=A^{\top}b} $$ $$ \mathbf{BB^{\top}x=A^{\top}b} $$ Therefore, you need to first compute $\mathbf{A^{\top}b}$ which is $\mathcal{O}(n^{2})$ and then compute $\mathbf{c=A^{\top}A}$ which is $\mathcal{O}(2n^{3})$ (but can also be reduced) then performing the Cholesky decomposition, then solving the system $\mathbf{By=c}$ using backward substitution and then solving $\mathbf{B^{\top}x=y}$ using forward substitution. Note that $\mathbf{B}$ is a lower triangular matrix.

The issue with the Cholesky decomposition is that it may yield numerical instability because the condition number of $\mathbf{A^{\top}A}$ is square the condition number of $\mathbf{A}$ and this is where you have to compromise between speed and accuracy. Always remember that the majority of matrices in application are ill-conditioned especially the vandermonde matrix that's used in regression. Thus, squaring the condition number of an ill-conditioned matrix results in a more unstable matrix decomposition. Also recall that $\mathbf{A^{\top}A}$ must be SPD to have a Cholesky decomposition.

$(4)$ : But it seemed to me those are just little pieces of a big picture, since I am not sure what decomposition to choose when it comes to a real problem. I encountered a lot of related questions on this site, but they all seemed to ask about some particular detail regarding some particular decomposition whereas I'm asking about all them at once.

"They" are right to ask you to provide more specific detail since "they" don't have enough information on :

  1. The type of matrix you have: is it ill-conditioned? is it symmetric positive definite? does it have the principal minor property? Is it sparse (i.e., does it have around $4\%$ non-zero entries?

  2. What type of problem are you working on? Is it a regression problem with some constraint on it? is it a homogeneous least square problem?

For 1., I would suggest you to take a look first at what MATLAB does when it wants to solve a system of equations $\mathbf{Ax=b}$, using the simple backslash operator $\mathsf{"\backslash"}$. Below is the routine that it follows :

Taken from MATLAB's documentation on backslash operator

Source : https://www.mathworks.com/help/matlab/ref/mldivide.html

However, it does not end here. What if the matrix is sparse? In such case, the backslash operator performs poorly since you have $\mathcal{O}(n)$ entries and each matrix decomposition is around $\mathcal{O}(n^{3})$ which is an overkill in some cases and might yield to numerical errors. Note that sparse matrices are more frequent than ordinary matrices especially in regression problem or modelling problems where you will have sparse data (google sparse regression).

The way to handle sparse matrices have been studied heavily in the 1950s using methods such as Jacobi method, Gauss-Seidal method, and the Sucessive Over Relaxation method (SOR) which are iterative methods that work on approximating $\mathbf{x}$ by iteratively finding $\mathbf{x}_{r}$ that solves $\mathbf{Ax_{r}=b}$ where $\|\mathbf{x}-\mathbf{x}_{r}\|<\operatorname{tol}_{r}$

Eventually, with the advancement of the field of numerical computing in the 1970s and 1980s a modern and powerful approach to deal with sparse matrices have emerged and its called the Krylov subspace methods which is still an active field of research. Kyrlov methods are also iterative methods that can handle essentially large sparse matrices.

One last note is that in Python, the library NumPy has a built in function that solves least square problems and unlike MATLAB's $\mathsf{lsqr(.)}$, this one uses SVD decomposition to solve least square problems.