I hear a lot about line search in optimization.
I am fine with the methods like Gradient Descent but what is line search and its uses?
Intuitively from what I understand right now, line search is basically trying a set of values of a parameter linearly i.e. along a line to see which one gives the best value of the function, right?
Basically, a properly designed line search guarantees the convergence of an algorithm to a locally optimal point in a finite number of steps. This process is called the globalization of an algorithm, which, to be fair, is kind of a bad name since it has nothing to do with finding the globally optimal point. In any case, what we normally require is that the step length of an algorithm give a certain amount of decrease in the objective, which we call the sufficient decrease condition, and that the curvature at this point be sufficient, which we call the curvature condition. One example of these conditions is called the strong Wolfe conditions. As it turns out, there are other ways to guarantee convergence in a finite number of steps such as trust-region algorithms. If you're curious as to how to prove convergence in both line-search and trust-region methods, Nocedal and Wright's book Numerical Optimization has the basic proofs.
In machine learning, the learning rate basically does what the line-search does by adjusting how far we take a step in our chosen direction. However, not all machine learning algorithms check the strong Wolfe conditions, so they are not guarantee to convergence in a finite number of steps. Furthermore, from a practical point of view, a decent line search will improve the performance of the algorithm beyond just theoretical guarantees. A fixed learning parameter or one that is decreased monotonically according to a specified scheme, frankly, performs poorly. Even something as simple as a golden-section search will improve things dramatically. Part of the reason for this is that computing a gradient tends to be expensive whereas computing the function tends to be cheap. As such, if we use a line-search that depends purely on objective computations or directional derivatives, which are cheaper than gradients to compute, we can better utilize our expensive computation by augmenting it with cheaper computations. That and the guarantee of convergence.