I have attempted to implement LU facotorization from pages 3-6 of this document:
http://www4.ncsu.edu/~kksivara/ma505/handouts/lu-pivot.pdf
An example of running the code below for a simple 4x4 matrix:
A=[2 1 1 0; 4 3 3 1; 8 7 9 5 ; 6 7 9 8];
intch=[];flag=0;
>> [Bp]=Gauss2(A,4,intch,flag);
>> B=lu(A);
>> Bp
Bp =
8.0000 7.0000 9.0000 5.0000
0.7500 1.7500 2.2500 4.2500
0.5000 -0.2857 -0.8571 -0.2857
0.2500 -0.4286 0.3333 0.6667
>> B
B =
8.0000 7.0000 9.0000 5.0000
0.7500 1.7500 2.2500 4.2500
0.5000 -0.2857 -0.8571 -0.2857
0.2500 -0.4286 0.3333 0.6667
As you can see the matrix Bp and B are the same, which proves that my LU factorization algorithm works correctly for a 4x4 matrix A. I have modified A to be a 60x60 and a 256x256, both of which also work correctly (i.e. Bp==B). However, I am now running this code on a dataset that is 4000x4000. My program does not finish (even after running overnight for 10 hours). In MATLAB, while the program is running, I can click on the "Pause" button. This places the program in debug mode and tells me the precise line of code the program is spending a lot of time on. It turns out, in the "Gauss2" function below, the program is spending a lot of time on the two nested for loops at the end of the function (see below where I have identified "BOTTLENECK" in comments).
How do I improve the speed of the matrix multiplication here? If you goto the very bottom of pg. 5 of http://www4.ncsu.edu/~kksivara/ma505/handouts/lu-pivot.pdf you will see Equations 21.6 and 21.7 which say:
L_k'=P(m-1)...P(k+1)L_k*P(k+1)...*P(m+1)
and
L=inverse(L(m-1)' .... L(2)'L(1)')
This is exactly what the two nested for loops below are trying to do. Is there a smarter way to compute L_k' ?
function [Bp]= Gauss2(A, n, intch, flag)
intch=zeros(n,1);
flag=0;
a=zeros(n,n);
for k=1:n-1
[amax,amax_index] = max(abs(A(k:n,k)));
amax_index=amax_index+k-1;
if (amax == 0)
flag=1; % matrix is singular
intch(k)=0;
else
intch(k)=amax_index;
if (amax_index ~= k)
% pivoting
% interchange rows k and m
A([k,amax_index],:) = A([amax_index,k],:);
end
% calculate multipliers store, and store in "a"
for i=k+1:n
a(i,k)=-A(i,k)/A(k,k);
end
% perform elimination for column "k"
for i=k+1:n
A(i,:)=A(i,:)+a(i,k)*A(k,:);
end
end
end
if (A(n,n)==0)
flag=1;
intch(n)=0;
else
intch(n)=n;
end
U=A; % this is the uppper matrix
% next, need to determine "L_k'" from algorithm
% for all k
prevX=eye(n);
for k=1:n-1
X=eye(n);
for j=(n-1):-1:(k+1)
X=X*P(j,intch(j),n); % <---- BOTTLENECK HERE
end
X=X*L(a,k,n);
for j=(k+1):(n-1)
X=X*P(j,intch(j),n); % < --- BOTTLENECK HERE
end
prevX=X*prevX;
end
Lower=inv(prevX);
Bp=U+Lower-eye(n);
end
Here is my code for P.m and L.m below. These are functions used in the code above to generate permutation matrices P_1,P_2,...,P_k and L_1, L_2,....L_k . I need to compute these in order to figure out the L_k' matrices, which can then be used to figure out the final L lower matrix.
function [Y]=L(a,k,n)
Y=eye(n);
Y((k+1):n,k)=a(k+1:n,k);
end
function [X]=P(i,j,n)
X=eye(n);
X([i,j],:) = X([j,i],:);
end
There are a number of ways to improve the efficiency. One way off the top of my head is to use the block LU decomposition. Instead of doing the original
$$ A= LU \tag{1}$$
we do $$ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22}\end{bmatrix} = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22}\end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22}\end{bmatrix} \tag{2}$$
there is code here.
Another aspect is considering the sparsity. Also, some algorithms multiply by random matrices to approximate the decomp.
Depending on the structure if it is positive definite symmetric you do a Cholesky decomposition which is twice as fast.
Note with the block LU we are solving different problems
$$ A_{11} = L_{11} U_{11} \tag{3} $$ $$ A_{12} = L_{11}U_{12} \tag{4} $$ $$ A_{21} = L_{21}U_{11} \tag{5} $$ $$ A_{22} = L_{21}U_{12} + L_{22}U_{22} \tag{6} $$
I believe...then your $4000 \times 4000$ matrix will become $4$ $2000 \times 2000 $ matrices