Computing $(A^TA)A^T$ versus $A^T(AA^T)$ on a Computer with Limited Memory where $A$ is $1000000 \times 2$

97 Views Asked by At

The following is an exercise taken from Aggarwal's Linear Algebra and Optimization for Machine Learning:

Let $A$ be an $1000000 \times 2$ matrix. Suppose you have to compute $A^TAA^T$ on a computer with limited memory. Is it more preferable to compute $(A^TA)A^T$ or $A^T(AA^T)$?

This question is stated at the very beginning of the text, so I only have the definition of matrix multiplication to work with.

I am thinking that it would be preferable to first compute $A^TA$. The reason being that this can be computed as the sum of one million $2\times 2$ outer product matrices, i.e. $A^TA = \sum_{j=1}^{10^6} (\mathrm{Row}_j(A))^T\mathrm{Row}_j(A)$. So to compute $A^TA$, we'd only need to be able to store two $2\times 2$ matrices at a time, just 8 numbers!

On the contrary, first computing $AA^T$ seems like it would require much more memory utilization.

In the context of someone who only knows about the definition of matrix multiplication, would this be correct? What if I knew heavier machinery, like say matrix factorizations, or other tools? Would there be a better way to approach this problem?

2

There are 2 best solutions below

0
On BEST ANSWER

In the context of someone who only knows about the definition of matrix multiplication, would this be correct?

$A^TA$ is a $2 \times 2$ matrix (4 total cells), while $AA^T$ is a $1000000 \times 1000000$ matrix ($10^{12}$ total cells). I think that the choice of which one to store as an intermediate result is obvious.

What if I knew heavier machinery, like say matrix factorizations, or other tools? Would there be a better way to approach this problem?

Sure, there are optimizations you could use. Like, if you happen to know that $A$ is a sparse matrix, you could use an alternate storage format that doesn't require explicitly storing all of its elements. You could calculate $AA^T$'s trillion elements on-demand during the second multiplication, instead of explicitly storing all of them. At the very minimum, you could take advantage of the fact that $AA^T$ is always symmetric to cut your storage space in half.

But regardless of how fancy you make your code, I just don't see any practical way to beat a $2 \times 2$ intermediate result, memory-wise.

0
On

The product of an $m\times n$ and an $n \times k$ matrix is an $m \times k$ matrix, and each entry requires $n$ multiplications to compute, for a total of $mnk$ multiplications (assuming the naive algorithm).

If $A$ is $m\times n$, then $A^T$ is $n\times m$, and so to compute $(A^TA)A^T$ requires $n^2m$ multiplications for the first matrix multiplication, after which we are multiplying together an $n\times n$ matrix and an $n\times m$ matrix, which will take $n^2m$ multiplications, for a total of $2n^2m$ multiplications.

On the other hand, to computer $A^T(AA^T)$ requires $m^2n$ multiplications to do the first multiplication, but then we are multiplying an $n\times m$ with an $m\times m$ matrix, which will take $m^2n$ multiplications, for a total of $2nm^2$ multiplications. If $m$ and $n$ are very different from each other, then $2nm^2$ and $2n^2m$ will be vastly different numbers.