I was reading these notes (if the previous link doesn't work, use this) on the subgradient method, it says that the choice for step sizes (or step lengths) are determined before the algorithm is run, I cannot seem to understand why is this done? What would happen if we use an algorithm similar to what we use in gradient descent, perhaps something like adagrad?
I'm trying to implement an SVM using a mini-batch stochastic gradient descent kind of method, while it does classify data points correctly. The problem is it doesn't quite maximize the slab thickness.
PS: If the link isn't accessible, find the relevant information below
Step size rules
- Constant step size, $\alpha_k = \alpha$ is a positive constant, independent of $k$.
- Constant step length, $\alpha_k = \gamma/\left|\left| g^{(k)} \right|\right|_2$. This means that $\left|\left| x^{(k+1)}-x^{(k)} \right|\right|_2 = \gamma$.
There are 3 more methods listed, which are square summable but not summable, Nonsummable diminishing, Nonsummable diminishing step lengths.
The notes further say
"The most interesting feature of these choices is that they are determined before the algorithm is run; they do not depend on any data computed during the algorithm. This is very different from the step size rules found in standard descent methods, which very much depend on the current point and search direction."