Why do we use average iterates when implementing SGD?

93 Views Asked by At

I read many papers about stochastic gradient descent (SGD). One thing I am curious is that many non-asymptotic convergence results are given with respect to the error between optimal solution and the average point of iterates generated by SGD. For example: $\mathbb{E}(f(\bar x^k)-f(x^*)) $ or $\mathbb{E}\left \| \bar x^k-x^* \right \| ^2$. My question is: why do we want to use the average of iterates? Why don't we just analyze the final iterate of SGD? Thank you!