Given a convex function $F(x)$ to be optimized with $F(x^*)$ being the optimal value at $x^*\in\mathbb{R}^n$.
The difference $|F(x)-F(x^*)|$ is called the excess error.
Using Stochastic Gradient Descent (with a certrain minibatch size) and a large training set, Goodfellow ("Deep Learning", p.288) states that the excess error for a strongly convex problem is $O(\frac{1}{k})$ after the the $k$-th iteration. For a batch GD (meaning using all of the data at once) this convergence rate is said to be better than for the minibatch method.
My professor says this not true (there must be some other factor within the order of the minibatch GD) and gave me the paper of Bottou, Bousquet - "The Tradeoffs of Large Scale Learning". (Note that SGD and GD in this paper correspond to using only one datapoint at each iteration and using a minibatch size at each iteration, respectively!
However, in this paper a slightly differerent notation is used. The excess error of Goodfellow should be equal to the tolerance/accuracy $\rho$. But it states a different result, namely that $O(\log(1/\rho))$ iterations are needed for accuracy $\rho$.
My questions:
- Is it true what Goodfellow writes?
- Is there any paper which measures the excess errors of minibatch GD and batch GD like in Goodfellow? (His sources are papers of Bottou which all use this accuracy $\rho$ and state something different)
- Can someone give a short overview of the convergence rate of excess errors in the SGD and GD case?
EDIT: Forgot the logarithm in the term for the order of accuracy.