Nature of the Hessian of the dual function?

580 Views Asked by At

I originally posted this over at MathOverflow but it did not receive much (...any) attention. I'm hoping someone can point me in the right direction over here.

Consider a nonlinear optimization problem of the form \begin{align} \min_{{\bf x}}&\quad f({\bf x})\\ \nonumber \text{subject to } \quad&{\bf h}({\bf x}) = {\bf 0}\\ \nonumber \quad&{\bf x}\in {\bf X} \end{align} where ${\bf X}$ is a nonempty compact, convex subset of $\mathbb{R}^n$. Assume that $f$ and ${\bf h}$ are twice-continuously differentiable. Form the partial Lagrangian of the above problem as \begin{align} L({\bf x},\lambda) = f({\bf x}) + \lambda^T{\bf h}({\bf x}) \end{align} Let the dual function be formed as \begin{align} \phi(\lambda) = \inf_{{\bf x}\in{\bf X}}L({\bf x},\lambda) \end{align}

The overall goal is to solve the dual problem, $\max_{\lambda}\phi(\lambda)$, through an iterative method (for example, dual decomposition) where the dual variables are updated via Newton's method. This requires computing the gradient and Hessian of the dual function.

Gradient of the Dual Function

Define the set ${\bf Y}(\lambda) = \{{\bf x}^*:{\bf x}^* \text{ minimizes }L({\bf x},\lambda)\text{ over }{\bf x}\in {\bf X} \}$. If the assumptions of the following theorem hold, the gradient of the dual function has a simple form.

Gradient of dual function (Thm 6.3.3 - Bazaraa, Sherali, Shetty; Nonlinear programming: Theory and Applications) Let $f({\bf x})$, ${\bf h}({\bf x})$ be continuous functions, and ${\bf X}$ be a nonempty compact set. If the set ${\bf Y}(\bar\lambda)$ reduces to a single element at the point $\bar\lambda$, then the dual function $\phi(\lambda)$ is differentiable at $\bar\lambda$ and its gradient is \begin{align} \nabla\phi(\bar\lambda) = {\bf h}({\bf x}^*) \end{align}

Hessian of the Dual Function

I am interested in the nature of the Hessian of the dual function (in particular, its rank and continuity). Let us maintain the assumptions from the above theorem. Introducing some notation, let us denote \begin{align} {\bf x}(\lambda) = \arg \min_{{\bf x}\in{\bf X}}L({\bf x},\lambda) \end{align} Thus the gradient of the dual function at any point $\lambda$ is given by $\nabla \phi(\lambda) = {\bf h}({\bf x}(\lambda))$. Differentiating this to obtain an expression for the Hessian, we obtain \begin{align} \nabla^2\phi(\lambda) = \nabla {\bf h}({\bf x}(\lambda))\nabla {\bf x}(\lambda) \end{align} Let us also assume that the matrix $\nabla {\bf h}({\bf x}(\lambda))$ has full rank (the LICQ constraint qualification).

ASIDE: In the absence of the ${\bf x}\in{\bf X}$ constraints, one can write $\nabla {\bf x}(\lambda) = [\nabla_{{\bf x}{\bf x}}^2 L({\bf x}(\lambda),\lambda)]^{-1}\nabla {\bf h}({\bf x}(\lambda))^T$ (see Luenberger, Linear and Nonlinear Programming).

Now, it seems to me that, due to the existence of the constraints ${\bf x}\in {\bf X}$, the change in the optimizer with respect to $\lambda$, denoted by $\nabla {\bf x}(\lambda)$, would have discontinuous jumps as certain primal variables bind along the boundary of ${\bf X}$ (or become free variables, as well). Denote the regions in the primal space where different variables bind by \begin{align} \mathcal{P}_{\bf X}^{\mathcal{I}}=\{{\bf x}:\text{ such that each } x_i,i\in\mathcal{I}, \text{ lies on the boundary of }{\bf X}\} \end{align} Corresponding to each region of binding variables in the primal space, there is a space of dual variables $\mathcal{D}^{\mathcal{I}}_\Lambda$ defined as \begin{align} \mathcal{D}_\Lambda^{\mathcal{I}}=\{\lambda:\text{ such that each } x_i(\lambda),i\in\mathcal{I}, \text{ lies on the boundary of }{\bf X}\} \end{align} The boundaries between the regions $\mathcal{D}^{\mathcal{I}}_\Lambda$ seem to be where the discontinuities of the Hessian occur.

Question: Does the Hessian $\nabla^2\phi(\lambda)$ possess any structure (e.g. negative definiteness) within each $\mathcal{D}_\Lambda^{\mathcal{I}}$? What is the standard machinery for dealing with the ${\bf x}\in{\bf X}$ constraints (are semismooth/generalized Newton's methods of any use)?


Update Question: The end goal of computing the Hessian is to use Newton's method to update the dual variables (in a dual decomposition method). To this end, I've been running some simulations on some various problems and it actually seems that just using the Hessian for the system without the ${\bf x}\in {\bf X}$ constraints, that is, \begin{align} \nabla^2\phi(\lambda) = \nabla {\bf h}({\bf x}(\lambda))[\nabla_{{\bf x}{\bf x}}^2 L({\bf x}(\lambda),\lambda)]^{-1}\nabla {\bf h}({\bf x}(\lambda))^T \end{align} performs quite well. Does anyone have any idea why this is the case? Are there any provable robustness guarantees of Newton's method that perhaps the system is automatically satisfying? Perhaps the positive numerical results will provide some insight into the original question.