To calculate $y = ABx$ where $A$ and $B$ are two $N\times N$ matrices and $x$ is an vector in $N$-dimensional space. there are two methods. One is to first calculate $z=Bx$, and then $y = Az$. This way takes roughly $2N^2$ multiplications. The other is to first calculate $C = AB$, and then $y=Cx$. This way, however, takes $N^3+N^2$ multiplications!
Am I correct on this? I do not understand why change of the order of the calculation introduces so much difference in the computation complexity.
I like this question. I've seen the solution to efficiently calculating a chain of matrix multiplications in an algorithms class, but it's not very enlightening. I've always meant to think more about the problem because associativity is interesting. So, let me try my hand at an explanation.
When you have an $N$ by $N$ matrix, it's a shorthand for describing a linear transformation from an $N$-dimensional vector space to itself. It does that by keeping track of where $N$ basis vectors get sent (these are the columns of your matrix).
Now, I'll just speak heuristically and it won't be very precise. I'll say that to write down a matrix, you just need to keep track of where $N$ linearly independent vectors get sent, not necessarily a basis. This is true, but all sorts of actual computational things depend on the exact basis (like actually writing down a vector, or doing standard calculations like $Bx$).
When you compute $AB$, you're keeping track of where $N$ linearly independent vectors get sent. But when you go on and use your freshly-computed $AB$ to compute $(AB)x$, you just care about where $1$ vector gets sent -- even though you went through all the trouble of keeping track of what $AB$ does to $N - 1$ other vectors!
Computing $A(Bx)$ is great, because at each step, it only pays attention to what's happening to the vectors you care about; it has no idea what happens to the other $N - 1$ linearly independent vectors, and so it only takes about one $N$th as much work as computing $(AB)x$.
While not exactly precise, I'd like to believe this is the idea behind why computing $A(Bx)$ is more efficient than $(AB)x$, which is what I think you are after.