Why should we update simultaneously all the variables in Gradient Descent

3.8k Views Asked by At

In the classic gradient descent algorithm, at each iteration step, we update all the variables simultaneously, i.e. $$\pmb{\theta}' \gets \pmb{\theta}-\mathbf{\alpha}\frac{\partial \mathbf{F}}{\partial \pmb{\theta}}$$

One alternative to this is that within each step we can update the variables as and when they are available.

For e.g. at each step: $$\pmb{\theta_1}' \gets \pmb{\theta_1}-\mathbf{\alpha}\frac{\partial \mathbf{F({\theta_1},\theta_2)}}{\partial \pmb{\theta_{1}}}$$ $$\pmb{\theta_2}' \gets \pmb{\theta_2}-\mathbf{\alpha}\frac{\partial \mathbf{F({\theta_1}',\theta_2)}}{\partial \pmb{\theta_{2}}}$$ I'm sure that this would also converge to the local optimum. So why is this alternate way of updation usually not the preferred way?

Edit: sometimes it makes sense not to update simultaneously. One use case would be that of training Neural Networks in NLP. Usually, we use Gradient Descent here but without the simultaneous updating because simultaneous updating from all the training examples takes a lot of time. Refere pg 33 of this pdf

3

There are 3 best solutions below

0
On BEST ANSWER

A simple example, let $f = \sin(\sum_{i=1}^n \alpha_i \theta_i)$. To compute all derivatives at a point you only have to evaluate $\sin$ once. If you cycle through all variables, you will have to evaluate $\sin$ $n$ times as the argument changes. Most often, it pays off to do steps in all coordinates at the same time. A simple analogy would be walking. You typically don't walk east-west direction first, and then north-south. You walk the shortest direction, i.e., move in both coordinates simultaneously.

0
On

Practically speaking, the when-available method you describe would lock you into performing $\textbf{sequential}$ computation. Part of the reason gradient descent has become such a popular algorithm recently, is that it can be computed in $\textbf{parallel}$.

Theoretically speaking, this wouldn't be as efficient. Imagine the values on which you perform gradient descent as a point in space (high-dimensional space probably). If you wish to update the values as you determine the gradient of those same values, you would be forced to either 1) recompute the gradient from scratch for each update or 2) use a perturbed gradient for an update - which would be slightly (or largely) incorrect. Remember that after that value-representing point moves in space, the gradients in it's neighborhood will be slightly different.

0
On

Gradient descent is a very natural idea because you just step in the direction of steepest descent repeatedly. But sometimes computing the full gradient at each iteration might be too expensive, so research has been done into many alternative methods such as asynchronous parallel coordinate descent methods, which may be like what you are suggesting. For example: "An Asynchronous Parallel Stochastic Coordinate Descent Algorithm" by Liu et al came up when I googled for this. See also the presentation "The Revival of Coordinate Descent Methods" (link to pdf) by Wright.