Exercise: Determine the amount of memory and flops for iteration $i$ of the following (GCR) algorithm:
$$\begin{split}\text{for }&i = 1,2,\ldots\text{ do}\\&s^i = r^{i-1},\\&v^i = As^i,\\&\text{for }j =1,\ldots,i-1\,\text{do}\\&\,\,\,\alpha = (v^j)^Tv^i,\\&\,\,\,s^i:=s^i - \alpha s^j,\,v^i:=v^i - \alpha v^j\\&\text{end for}\\&s^i:=s^i\big/\|v^i\|_2,\,v^i:=v^i\big/\|v^i\|_2\\&\beta = (v^i, r^{i-1});\\&u^i:=u^{i-1}+\beta s^i;\\&r^i = r^{i-1} - \beta v^i \end{split}$$
The given solution: In the algorithm the following vectors are stored: $b, u, r, s_k,$ and $v_k$. This implies that $(2i + 3)n$ memory positions are needed. Per iteration one needs to do one matrix vector multiplication, $i+1$ inner products, $2i$ vector updates, and $2$ scalings. In total one matrix vector multiplication and $(6i + 4)n$ flops for the remaining parts.
Question: Why does one iteration of the GCR algorithm totals one matrix multiplication plus $(6i + 4)n$? (I would also like to know why storing the vectors $b,u,r,s_k$ and $v_k$ requires $(2i+3)n$ memory positions, but perhaps this isn't the right place to ask about that.)
What I've tried myself: During class I was told that a flop is the cast of a floating point operation as there are $+, -, *, /$ and $\sqrt{}$. I understand that there are $i+1$ inner products; there are $i -1$ inner products calculated in the second loop and $2$ more once the second loop is done. The $2i$ vector updates make sense as well; in the second loop there are $2i - 2$ vector updates, and when the second loop has ended there are two more (if you don't count the scaling as a vector update as well). The scalings seem obvious, there are two. If I add this all together I get one matrix vector multiplication and $3i + 3$ remaining flops; what am I doing wrong?
Thanks in advance!
An inner product $$(a,b)=\sum_{k=1}^na_k*b_k$$ takes $2n-1$ FLOPs: $n$ to multiply components and $n-1$ to add up the products.
A vector update $a=a+cb$ takes $2$ FLOPs for each component: $a_k=a_k+c*b_k$ or $2n$ FLOPs to update all components.
A scaling takes $n$ flops, $1$ to scale each component.
To compute $||v^i||_2$ required not only an inner product but also a square root, so I get $(i+1)(2n-1)+(2i)(2n)+(2)(n)+1=(6i+4)n-i$ FLOPs. So I disagree with the given answer a little because of that $-i$ in my result, but hopefully you can see why there are so many more FLOPs than you counted because you didn't take into account that vector operations require multiple FLOPs.