Efficient Numerical Optimization for Gradient Descent with Constraints (Lagrangian Multiplier)

1.1k Views Asked by At

I have rummaged through the internet, StackExchanges as well as well-known books for numerical methods on optimizations (e.g.[1],[2],[3],[4],[5],[6],[7],[8]) to find my answer. Yet, still I feel a bit fuzzy about my optimization problem:

\begin{equation*} \begin{aligned} & \underset{x}{\text{minimize}} & & f(x) \\ & \text{subject to} & & g_i(x) \leq d_i, \; i = 1, \ldots, n. \end{aligned} \end{equation*}

where $x=(x_1,x_2,\dots,x_m)$.

I'm trying to solve it using the method of Lagrange multipliers where my Lagrangian is as follows ($\lambda=(\lambda_1,\lambda_2,\dots,\lambda_n)$):

\begin{equation*} \begin{aligned} L(x,\lambda)=f(x)+ \sum_{i=1}^n \lambda_i(g_i(x)-d_i) \end{aligned} \end{equation*}

I see three major ways to solve it:

  1. Direct application of gradient descent on $L$ to hopefully find a critical point (i.e. saddle point or minima).
  2. Using Newton-Raphson's or Broyden's method to solve this non-linear system of equations: $\nabla_{x,\lambda}f(x,\lambda)=0$
  3. Using gradient descent to find a local minimum of the magnitude of the gradient of Lagrangian, $h(x,\lambda) = |\nabla_{x,\lambda}f(x,\lambda)|=(\sum_{i=1}^m \frac{\partial L}{\partial x_i})^2 + (\sum_{j=1}^n \frac{\partial L}{\partial \lambda_j})^2$

Issues with aforementioned methods:

  1. The critical points of Lagrangians occur at saddle points, rather than at local minima (or maxima). Unfortunately, many numerical optimization techniques, such as hill climbing, gradient descent are designed to find local maxima (or minima) and not saddle points. Advanced gradient descent (e.g stochastic gradient descent) has a tendency to escape from saddle points and find minima. Interestingly, [7] has shown that gradient descent almost surely ends up in a minima rather than a saddle point. Yet, I'm thinking if I strictly stick with $\nabla L(x,\lambda)=0$ and do not use a noisy gradient descent, it should assure that saddle points will also be captured. Especially that the number of saddle points increases with dimension ($m+n \approx 10^6$). So, I was thinking maybe this method is faster and has an advantage of less computational complexity (compared to other two) but at the same time, $L$ looks to be pretty non smooth and getting to a saddle point might take a long time. Also, it might be dangerous that gradient descent may never converge and look for a minima forever. I'm not sure though.
  2. It seems to be over killing -- as a local minima or even a saddle point is a solution for me.
  3. This assures me that I end up in a local minima (which would be a critical point for $L$). It, however, requires to compute the gradient of the magnitude of the gradients, which does not seem to be easily differentiable! On the other hand, one may use finite difference to compute such gradients.

While the first method seems to be computationally less expensive compared to the third one, it might be dangerous. On the other hand, if it works, since there might be many saddle points it may end up quicker or may actually suffer from non-smoothness. The third one, also, might not be appropriate for the size of my problem. Which method seems to be, firstly, applicable and, secondly, more efficient for my problem? (quasi-Newton, Powell’s method and Conjugate gradient are also considered).


In case you wondering what my objective function and constraints are $a_i,b_i,c_i >0$:

$$f(x)=\sum_{j=1}^m\frac{a_j}{x_j} \\ g_i(x)=\frac{\sum_{j=k_i}^{m_i}\frac{b_j}{x_j}}{\sum_{j=k_i}^{m_i}\frac{c_j}{x_j}}$$

Also, $m \approx 10^6$ and $n\approx 10^5$.