Equivalence of the method of multipliers and the proximal point method

69 Views Asked by At

Recently I am reading the papers On the convergence of the exponential multiplier method for convex programming by Paul Tseng and Dimitri P. Bertsekas, and Penalty/Barrier Multiplier Methods for Convex Programming Problems by Aharon Ben-Tal and Michael Zibulevsky. In both papers, the authors claimed that the method of multipliers and the proximal point method are mathematically equivalent. However, after carefully reading their proofs, I am still confused about the result.

Let me first write down the problem and methods for convenience:

Convex programming

Consider the following problem: \begin{equation} \label{eq:primal} \begin{aligned} \min\hspace{1em} &f(x)\\ \text{s.t.}\hspace{1em} &g_j(x) \leq 0\hspace{1em} \forall j \in [m] \end{aligned} \end{equation} where $f, g_j: R^n\rightarrow R$ are closed proper convex functions. The Lagrangian of this problem is: \begin{equation} L(x, u) = f(x) + \sum_{j=1}^{m} u_j g_j(x). \end{equation} The dual function is $d(u) = \min L(x,u)$, and the dual problem is \begin{equation*} \begin{aligned} \max\hspace{.5em} &d(u), \hspace{.5em} u\geq 0. \end{aligned} \end{equation*}

The method of multipliers

The method of multipliers (or penalty/barrier multiplier method, augmented Lagrangian method) solves the convex program by introducing a proper penalty function $\varphi$ (in the classical augmented Lagrangian method, it is just a quadratic penalty function), such that the constraints $g_j(x)\leq 0$ are equivalent to: \begin{equation} p_j\varphi(g_j(x)/p_j)\leq 0, \end{equation} where $p_j>0$ is the penalty parameter. Now the original problem is equivalent to: \begin{equation*} \begin{aligned} \min\hspace{1em} &f(x)\\ \text{s.t.}\hspace{1em} &p_j\varphi(g_j(x)/p_j) \leq 0\hspace{1em} \forall j \in [m] \end{aligned} \end{equation*} and we have the (non-quadratic) augmented Lagrangian: \begin{equation*} L(x, u, p) = f(x) + \sum_{j=1}^{m} u_j p_j \varphi(g_j(x) / p_j). \end{equation*} In the $(k+1)$-th iteration, we minimize $L(x,u^{(k)},p^{(k)})$ w.r.t. $x$: \begin{equation} x^{(k+1)} \gets \arg\min_x L(x,u^{(k)},p^{(k)}), \end{equation} and the multipliers $u^{(k)}_j$ are also updated: \begin{equation} u^{(k+1)}_j \gets u^{(k)}_j\cdot \nabla\varphi(g_j(x^{(k+1)})/p^{(k)}_j). \end{equation} (The penalty parameters can also be updated, but its not related to our discussion.)

An intuition of the choice of $u^{(k+1)}$ is that, by such a choice, the minimizer $x^{(k+1)}$ to the augmented Lagrangian $L(x, u^{(k)}, p^{(k)})$ is also a minimizer to the Lagrangian $L(x, u^{(k+1)})$, i.e. \begin{equation} x^{(k+1)} = \arg\min_x L(x, u^{(k+1)}). \end{equation}

The proximal point method

The proximal point method solves the dual problem by introducing a proximal term (in the classical setting, the proximal term is just $\ell_2$ norm square): \begin{equation} u^{(k+1)}\gets \arg\max_{u\geq 0} d(u) - D(u, u^{(k)}). \end{equation} The proximal term $D(u, u^{(k)})$ is a (separable) distance function defined as: \begin{equation} D(u, {u}^{(k)}) = \sum_j {p}^{(k)}_j {u}^{(k)}_j \varphi^{*}(u_j / {u}^{(k)}_j), \end{equation} where $\varphi^*$ is the conjugate function of $\varphi$.

The equivalence of these two methods

In both papers, the authors showed the equivalence of these two methods in such a way:

  1. Since $x^{(k+1)} =\arg\min_x L(x, u^{(k+1)})$, we can show that: \begin{equation} \label{eq:subgrad_dual} \left(\begin{array}{c} g_1({x}^{(k+1)})\\ \vdots\\ g_m({x}^{(k+1)}) \end{array}\right) \in \partial d({u}^{(k+1)}). \end{equation}

  2. By the choice of $u^{(k+1)}$ in the method of multipliers, we see that: \begin{equation*} g_j({x}^{(k+1)}) = {p}^{(k)}_j \nabla\varphi^* ({u}^{(k+1)} / {u}^{(k)}), \end{equation*} which is exactly the gradient of the distance function, i.e. \begin{equation} \label{eq:subgard_prox} \left(\begin{array}{c} g_1({x}^{(k+1)})\\ \vdots\\ g_m({x}^{(k+1)}) \end{array}\right) \in \partial D({u}^{(k+1)}, {u}^{(k)}). \end{equation}

Combining these two, we see that \begin{equation} 0\in \partial d(u^{(k+1)}) - \partial D({u}^{(k+1)}, {u}^{(k)}). \end{equation} This shows that $u^{(k+1)}$ in the method of multipliers is also the optimum of the proximal problem.

However, I think there might be some issues with this argument:

  1. Though the update of $u^{(k)}$ looks the same in both methods, the sequences $\{u^{(k)}\}$ generated from the methods are not the same. My reason is that, in the method of multipliers, $x^{(k+1)}$ is the minimizer of $L(x, u^{(k)}, p^{(k)})$ (and also $L(x, u^{(k+1)})$), but in the proximal point method, $x^{(k+1)}$ should be the minimizer of $L(x, u^{(k)})$, which is different and we cannot mix the two $x^{(k+1)}$'s up.

  2. To show that the vector of $g_j(x^{(k+1)})$ is a subgradient of $d(u^{(k+1)})$ (Point 1 in the above sketch), it is required that $x^{(k+1)}=\arg\min_x L(x, u^{(k+1)})$, which implies that $x^{(k+1)}$ should be computed after $u^{(k+1)}$ in the proximal point method. However, to show that the vector of $g_j(x^{(k+1)})$ is a subgradient of $D(u^{(k+1)}, u^{(k)})$ (Point 2), it is required that $u^{(k+1)}$ is chosen based on the updating rule in the method of multipliers, i.e. computed from $g_j(x^{(k+1)})$, which implies that $x^{(k+1)}$ should come first and contradicts with Point 1.

  3. My intuition behind this is that, the proximal point method is kind of like doing a mirror ''ascent'' (or gradient ascent in the classical setting) for the dual problem. However, if we know that $g_j(x^{(k+1)})$ is a subgradient of $d(u)$ at the point $u^{(k+1)}$, the subgradient should be used to compute the next point $u^{(k+2)}$, instead of $u^{(k+1)}$ as in the method of multipliers (which seems counter-intuitive since we compute the point use the subgradient at that point).


I am not an expert in optimization and I am still learning all these things. Please point out any mistakes I made and I'll appreciate any comments!

1

There are 1 best solutions below

4
On
  1. You are right that the sequences ${u^{(k)}}$ generated from the method of multipliers and the proximal point method are not the same in general, since they are generated from different minimization problems. However, the key observation here is that the two methods share a common fixed point. This means that, although the sequences generated by the two methods are different, they both converge to the same solution, provided that the assumptions required for convergence hold. So, it is not required that the sequences themselves be the same, but rather that their limits are the same.

  2. Indeed, there seems to be a confusion in the order of operations. However, the important thing is to show that the updates of $x^{(k+1)}$ and $u^{(k+1)}$ in both methods lead to a common fixed point. The exact order of updates might differ, but the main idea is that these update rules have the same effect, which is to drive the algorithm to the optimal solution. So, although the order of operations might be different, the behavior of the two methods is the same in terms of their convergence properties.

  3. Your intuition about the proximal point method being a mirror ascent for the dual problem is correct. However, the purpose of showing the equivalence of the two methods is not to claim that they are identical in every step, but rather to show that they share the same convergence properties and reach the same fixed point. The subgradient update rule for $u^{(k+1)}$ in the method of multipliers might look counter-intuitive when compared to the proximal point method, but the important thing to note is that both methods are driving the algorithm towards the optimal solution in their respective ways.