Recall stochastic gradient descent in the case of regression for n training pointes:
$\text{randomly select }\, t \in [1,n],\{\\ \quad \theta^{(k+1)} = \theta^{k} + \eta_k(y^{(t)} - \theta \cdot x^{(t)}) x^{(t)}\\ \}$
In some notes I am following for a class it claims that if the learning rate is
$\eta_k = \frac{1}{k+1}$
stochastic gradient descent was going to converge.
In fact, it claimed that for any positive learning rate $\eta_k$, if it satisfied:
$\sum^{\infty}_{k=1} \eta_k^2 < \infty, \sum^{\infty}_{k=1} \eta_k < \infty$
Then the algorithm would converge (and the speeds may vary). However, it was not clear to me why that is true.
Does someone understand why this is the case? In contrast with batch gradient descent, it seems "intuitive" why it would converge if you follow the negative gradient of the function you are trying to optimize, if its convex, then it should converge. But what it was not clear to me was, how come it was not possible to stochastic gradient descent to just cycle and never converge and how the learning rate took part of this converges (as well as the order of how the points are chosen).
Thanks in advance!