stochastic subgradient descent

276 Views Asked by At

When using stochastic subgradient descent, the solution $f_{best}(x_k)= \min \{f(x_1),f(x_2),....f(x_k) \}$, i.e., the best "point" over all the steps. As I understand, I should evaluate the function using the whole dataset for this method. However, the stochastic method is used because using the whole dataset to evaluate the derivative is very slow.

My Question is: What is the advantage using stochastic subgradient descent and not subgradient descent, if I have to use the whole dataset anyway?

1

There are 1 best solutions below

0
On

The reason to use stochastic subgradient descent (and more generally stochastic gradient descent), even though you may need to visit the entire data set is related to computational complexity. In full subgradient descent, you need to access the entire data set for a single gradient step, whereas in stochastic descent you merely update the target function by a small subset of the data. If your data set is extremely large (as is increasingly the case in this field), then computing over all the data at a single time has space complexity that is simply too high for realistic hardware. Moreover, there are a number of tricks for parallelizing stochastic gradient descent that have no simple equivalent in full descent (for instance searching along different axes simultaneously and in parallel, then merging the partial results).