The following is an exercise taken from Aggarwal's Linear Algebra and Optimization for Machine Learning:
Let $A$ be an $1000000 \times 2$ matrix. Suppose you have to compute $A^TAA^T$ on a computer with limited memory. Is it more preferable to compute $(A^TA)A^T$ or $A^T(AA^T)$?
This question is stated at the very beginning of the text, so I only have the definition of matrix multiplication to work with.
I am thinking that it would be preferable to first compute $A^TA$. The reason being that this can be computed as the sum of one million $2\times 2$ outer product matrices, i.e. $A^TA = \sum_{j=1}^{10^6} (\mathrm{Row}_j(A))^T\mathrm{Row}_j(A)$. So to compute $A^TA$, we'd only need to be able to store two $2\times 2$ matrices at a time, just 8 numbers!
On the contrary, first computing $AA^T$ seems like it would require much more memory utilization.
In the context of someone who only knows about the definition of matrix multiplication, would this be correct? What if I knew heavier machinery, like say matrix factorizations, or other tools? Would there be a better way to approach this problem?
$A^TA$ is a $2 \times 2$ matrix (4 total cells), while $AA^T$ is a $1000000 \times 1000000$ matrix ($10^{12}$ total cells). I think that the choice of which one to store as an intermediate result is obvious.
Sure, there are optimizations you could use. Like, if you happen to know that $A$ is a sparse matrix, you could use an alternate storage format that doesn't require explicitly storing all of its elements. You could calculate $AA^T$'s trillion elements on-demand during the second multiplication, instead of explicitly storing all of them. At the very minimum, you could take advantage of the fact that $AA^T$ is always symmetric to cut your storage space in half.
But regardless of how fancy you make your code, I just don't see any practical way to beat a $2 \times 2$ intermediate result, memory-wise.