In gradient descent, we calculate the negative gradient of a function and change our inputs in this direction. This is repeated until the gradient becomes very small, meaning that we are close to a local minimum.
How far we move in the direction of the negative gradient is determined by the step size, which is a parameter that is important to choose well.
In cases where calculating the gradient is expensive, but evaluation is not:
Does it make sense to move several steps into the same direction of a calculated gradient, until the output starts increasing?
At this point we can go back (half?) a step and calculate the next gradient / propagate the $stepsnumber * stepsize * gradient$ backwards.
This would assume taking a relatively small step size.
Possible reasons that sound plausible are:
- It is already done successfully.
- Computing the forward pass is not cheaper than re-evaluating the gradient.
- There is some incompatibility with backpropagation. Reason?
- In high dimensions, something subtle happens.
- Some other practical consideration.
If it is one of those, please explain what exactly happens.
I am mainly interested in applications in neural networks, and if a method like this is being used there, or if there are reasons not to.
As others have mentioned, what you're suggesting is essentially called line search (see also backtracking line search). I suspect that line search is not ubiquitous in machine learning because stochastic descent methods (which take one step per individual training datum per epoch) have better generalization performance and it'd be counterproductive to spend time doing a line search on each one of those single-datum steps; it'd be time spent pointlessly "fine tuning a heuristic." (Note another reason why single-datum or small batch steps are preferred is because large batches like the full data set are often too large to practically work on all at once).