Why not: Multiple Steps in the Direction of One Gradient

Question

Why not: Multiple Steps in the Direction of One Gradient

263 Views Asked by Bumbble Comm At 10 May 2026 - 3:55

In gradient descent, we calculate the negative gradient of a function and change our inputs in this direction. This is repeated until the gradient becomes very small, meaning that we are close to a local minimum.

How far we move in the direction of the negative gradient is determined by the step size, which is a parameter that is important to choose well.

In cases where calculating the gradient is expensive, but evaluation is not:

Does it make sense to move several steps into the same direction of a calculated gradient, until the output starts increasing?

At this point we can go back (half?) a step and calculate the next gradient / propagate the $stepsnumber * stepsize * gradient$ backwards.

This would assume taking a relatively small step size.

Possible reasons that sound plausible are:

It is already done successfully.
Computing the forward pass is not cheaper than re-evaluating the gradient.
There is some incompatibility with backpropagation. Reason?
In high dimensions, something subtle happens.
Some other practical consideration.

If it is one of those, please explain what exactly happens.

I am mainly interested in applications in neural networks, and if a method like this is being used there, or if there are reasons not to.

Original Q&A

There are 3 best solutions below

Bumbble Comm On 15 Jun 2018 - 2:23

Yes, you can for example make a line search (Golden ratio or Fibonacci) in the direction of the negative gradient (bisection also works like you said). There are also more advanced methods like: Powell's method.

Hope this helps.

Bumbble Comm On 17 Jun 2018 - 5:23

It sounds like you should look into applying adaptive learning rate in your neural net training algorithm. This would achieve the accelerated learning you want and is pretty common. You could also look into momentum which will speed up training and avoid getting trapped in local optima, and is also a commonly used approach.

**Bumbble Comm** · Accepted Answer

As others have mentioned, what you're suggesting is essentially called line search (see also backtracking line search). I suspect that line search is not ubiquitous in machine learning because stochastic descent methods (which take one step per individual training datum per epoch) have better generalization performance and it'd be counterproductive to spend time doing a line search on each one of those single-datum steps; it'd be time spent pointlessly "fine tuning a heuristic." (Note another reason why single-datum or small batch steps are preferred is because large batches like the full data set are often too large to practically work on all at once).

Why not: Multiple Steps in the Direction of One Gradient

There are 3 best solutions below

Related Questions in GRADIENT-DESCENT

Related Questions in NEURAL-NETWORKS

Trending Questions

Popular # Hahtags

Popular Questions