Dual residual for linearized ADMM

1.6k Views Asked by At

I am using linearized ADMM for a problem with a (non-smooth) convex loss function $f(x)$, and a hard constraint $x \in E$, where $E$ is an ellipsoid in $R^d$. I have encoded the hard constraint as $A x \in B_d(1)$, where $B_d(1)$ is the unit ball in $R^d$ and $A$ is the Cholesky factor of the quadratic form $Q$ that expresses $E$ (i.e., $E$ is the set $x:\, x' Q x \leq 1$, and $A' A = Q$).

Using linearized ADMM I express the constraint as $g(Ax) < \infty$, where $g(z)$ is the indicator (in Boyd's terminology) of $B_d(1)$. The advantage of using linearized ADMM rather than standard ADMM is of course that projecting onto a ball is much easier (and faster) than projecting onto an ellipsoid.

Now, it turns out that the rate of convergence of the algorithm depends heavily on the choice of lambda (and mu) (notation from Section 4.4.2 of the paper Proximal algorithms by Parikh and Boyd). The (long) paper on ADMM by Boyd et al. (Distributed optimization...) has a discussion on primal and dual residuals, and how the relative sizes of these can be used to adaptively tune the scaling parameter $\rho$ of ADMM (Section 3.3-3.4).

For linearized ADMM the primal residual is the same as for standard ADMM (I assume), but what about the dual residual? Working from Section 4.4.2 of the Proximal algorithms paper, I have come up with the expression

$$-(1/\lambda) A' (z^{k+1}-z^k) + (1/\lambda) A' A (x^{k+1}-x^k) - (1/\mu) (x^{k+1}-x^k)$$

for the dual residual (here the usual $\rho$ of ADMM is $1/\lambda$). When using this expression in the particular example, it "behaves as expected": setting $\lambda$ in a good way gives quick convergence and primal and dual residuals of the same magnitudes; setting $\lambda$ in a poor way gives slow convergence an unbalanced residuals.

Is the above expression found somewhere in the literature (so I can confirm it is correct)?

Deriving it I found what appears to be a misprint in the Proximal algorithms paper, namely line 3 from bottom on p.158, where $\mu/2$ should be $1/(2\mu)$ I think. Does this make sense?

/Tobias

1

There are 1 best solutions below

3
On BEST ANSWER

A good way to see clear in your issues is to assimilate the relavant literature on this subject. Along this line, fairer references (rather than the Boyd article, which is good for its own purposes) on ADMM are:

  • Eckstein, J., Bertsekas, D.P.: On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming 55 (1992)

  • Glowinski, R., Marroco, A.: Sur l’approximation, par ́elements finis d’ordre un, et la resolution, par p ́enalisation-dualit ́e d’une classe de problemes de Dirichlet non lin ́eaires. ESAIM: Mathe- matical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique 9 (1975)

  • Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications 2 (1976)

Now, it is common knowledge that the main issue with ADMM is that its "rate of convergence" is highly dependent on the $\rho$ parameter (the stepsize for ascending in the dual objective), and in an ununderstood way. There is no free launch! :)

However, for special cases like Ridge regression, quadratic programming, nonnegative Lasso, and recently even more general problems for which the loss function $f$ is $\mathcal{C}^2$ with Lipschitzian gradient and the penalty $g$ is so-called partly-smooth, we can show local linear (i.e exponentially fast) convergence, and we even have explicit formulae in terms of $\rho$ for the rate in some of these cases (thus there's hope one can choose $\rho$ so as to get the best possible ADMM implementation for the given problem). See the references hereunder.

  • Ghadimi, E., Teixeira, A., Shames, I., Johansson, M.: Optimal parameter selection for the alternating direction method of multipliers (ADMM): Quadratic problems. arXiv preprint arXiv:1306.2454 (2013)

  • Liang, J., Fadili, J., Peyre, G.: Activity identification and local linear convergence of inertial forward-backward splitting. arXiv preprint arXiv:1503.03703 (2015)

  • Liang, J., Fadili, J., Peyre, G., Luke, R.: Activity identification and local linear convergence of Douglas–Rachford/ADMM under partial smoothness. arXiv preprint arXiv:1412.6858 (2014)

  • Boley, D.: Local linear convergence of the alternating direction method of multipliers on quadratic or linear programs. SIAM Journal on Optimization 23 (2013)

  • Nishihara, R., Lessard, L., Recht, B., Packard, A., Jordan, M.I.: A general analysis of the convergence of admm. arXiv preprint arXiv:1502.02009 (2015)

  • Eventual linear convergence of the Douglas Rachford iteration for basis pursuit http://arxiv.org/abs/1301.0542

Also, you might want to look at

which should applicable to your class of problems.

I hope this is all useful.