How "Optimal" are Solutions from Gradient Descent?

42 Views Asked by At

When optimizing the Loss Functions of Neural Networks using (some version of the) Gradient Descent algorithm, I have often heard this situation described as a Sequential Optimization Problem. This means that because it is usually impossible (i.e. too computationally expensive) to optimize the Loss Function by jointly solving a system of Non-Linear equations, we have to "sequentially" find better and better solutions to this problem until some criteria is met (e.g. error threshold, number of iterations).

I am interested in knowing if this type of optimization problem (i.e. Sequential Optimization of Loss Functions) has ever been studied through the lens of Bellman Optimality. For instance, do we have any reasons to believe that such optimization problems might have "Optimal Substructure and Overlapping Subproblems"?

I am interested in knowing about this for the following reason: Do we have any theoretical reasons to believe that sequential optimization of Loss Functions has any "chance" at finding some type of Optimal Solution - or do we just assume this is true and hope for the best (i.e. a "good enough solution")?

  • Can sequentially solving these kinds of optimization problems theoretically lead to some optimal solution?

Thanks!