Nonlinear optimization with constraints

485 Views Asked by At

Let $w$ be some vector. I want to solve: $$ w=\arg\min \sum f_i^2(w) \\s.t. w\geq 0, \sum w_i = 1$$ That can be formulated as follows: $$ w=\arg\min ||{\bf f}(w)||^2 \\ s.t. \quad H(w)\geq 0, G(w)= 0 $$ where $H(w)=w,G(w)={\bf 1}^T w-1=\sum w_i-1$.

This seems to me a standard problem. The unconstrained problem can be solved using the Gauss-Newton (GN) method. However, I can't seem to find how to solve the constrained problem. Which method should I use? can GN be adapted to these constraints?

Thank you.

EDIT:

a. I will add a formula for $f_i(w)$: $$ f_i(w) = w^T A_i w - z_i$$ where $A_i$ is a symmetric matrix and $z_i$ are known scalars.

b. Trying to find my answer, I encountered the Levenberg–Marquardt algorithm, which appears almost suitable here, but I do have a matrix $A_i$ and not a scalar that varies with $i$.

c. I thought of doing the following: since in each iteration of solving the Gauss-Newton method I solve a linear LS, I would solve instead a constrained LS with my constraints. Is that a right way to go?

Thanks again.

2

There are 2 best solutions below

5
On BEST ANSWER

Your optimization problem has the form \begin{align} \tag{$\spadesuit$}\operatorname{minimize}_w &\quad F(w) \\ \text{subject to} & \quad w \in C \end{align} where $$ F(w) = \sum_i f_i(w)^2 $$ and $C$ is the probability simplex (which is a closed, convex set).

You can solve problems of the form ($\spadesuit$) using the projected gradient method: $$ w^{k+1} = \Pi_C(w^k - t\nabla F(w^k)). $$ Here $\Pi_C(y)$ is the projection of $y$ onto $C$. Projecting onto the probability simplex is a very well studied operation and can be performed very efficiently with existing algorithms. (It takes only a few lines of Matlab. See algorithm 1 here, for example: https://arxiv.org/abs/1309.1541 .)

The projected gradient method has a step size restriction, so be sure to select the step size $t$ small enough so that the step size restriction is satisfied. There is also a line search procedure you can use to select step sizes(which I think is probably a good idea).

You could also try using an accelerated version of the projected gradient method, such as FISTA, although I'm not sure how they will perform for nonconvex problems.

Edit: The projected gradient method is a special case of the proximal gradient method, and a line search version of the proximal gradient method is described in equations 1.26 and 1.27 (p. 18) of the paper "Gradient-Based Algorithms with Applications to Signal Recovery Problems" by Beck and Teboulle. This line search version of the proximal gradient method is also described, with slightly different notation, on slide 6-20 of Vandenberghe's 236c notes: http://www.seas.ucla.edu/~vandenbe/236C/lectures/proxgrad.pdf.

For convenience, I'll write the proximal gradient method with line search in detail, using notation that I think is a little more clear than the notation in the above references. This algorithm minimizes $f(x) + g(x)$, where $f$ is assumed to be differentiable (with a Lipschitz continuous gradient) and $g$ is a convex, lower semicontinuous function. The proximal operator of $g$ with parameter $t > 0$ is denoted by $\operatorname{prox}_{tg}$. If you don't know what the proximal operator is, that's fine, just replace $\operatorname{prox}_{tg}$ with $\Pi_C$, and you have the projected gradient method. Here is the algorithm: \begin{align} &\text{Initialize } x_0 \text{ and select } t_0 > 0, 0 < \alpha < 1, \beta > 1\\ &\textbf{for } k = 1,2,\ldots \textbf{do:}\\ & \quad t := \beta \, t_{k-1} \\ &\quad \textbf{repeat:} \\ &\qquad \quad x := \operatorname{prox}_{tg}(x_{k-1} - t \nabla f(x_{k-1})) \\ &\qquad \quad \textbf{break if } f(x) \leq f(x_{k-1}) + \langle \nabla f(x_{k-1}),x - x_{k-1} \rangle + \frac{1}{2t} \| x - x_{k-1} \|_2^2 \\ &\qquad \quad t := \alpha t \\ & \quad t_k := t \\ & \quad x_k := x\\ &\textbf{end for} \end{align}

I usually select $\alpha = .5$ and $\beta = 1.25$. I think $t_0 = 1$ is usually not a bad choice. You could initialize $x_0$ to be your best available guess of the optimal value of $x$, or perhaps just initialize $x_0$ randomly or to all zeros.

Here is some python code for the proximal gradient method with line search:

def proxGrad(gradf,proxg,evalf,evalg,x0,params):
    # minimize f(x) + g(x). f is smooth, g has easy prox-operator

    maxIter = params['maxIter']
    t = params['stepSize'] # initial step size
    showTrigger = params['showTrigger']
    reduceFactor = .5
    increaseFactor = 1.25
    costs = np.zeros((maxIter,1))

    xk = x0
    for k in np.arange(maxIter):

        (gradf_xk, fxk) = gradf(xk)

        gxk = evalg(xk)
        costs[k] = fxk + gxk
        if k % showTrigger == 0:
            print "Iteration: " + str(k) + "    cost: " + str(costs[k])

        t = t*increaseFactor
        acceptFlag = False
        while acceptFlag == False:
            xkp1 = proxg(xk - t*gradf_xk, t)
            fxkp1 = evalf(xkp1)
            diff = xkp1 - xk
            if fxkp1 <= fxk + np.vdot(gradf_xk,diff) + (.5/t)*np.sum(diff**2):
                acceptFlag = True
            else:
                t = t*reduceFactor
                print "Reducing t. t is now: " + str(t)

        xk = xkp1

    return (xkp1,costs)

Here's a little more info about the relationship between the proximal gradient method and the projected gradient method. The projected gradient method for minimizing $f(x)$ subject to the constraint that $x \in C$, is just the special case of the proximal gradient method where $g$ is the convex indicator function of $C$: $$ g(x) = \begin{cases} 0 & \quad \text{if } x \in C \\ \infty & \quad \text{otherwise.} \end{cases} $$ With this choice of $g$, the proximal operator of $g$ is given by the formula $$ \operatorname{prox}_{tg}(x) = \Pi_C(x) = \text{projection of $x$ onto $C$}. $$ (I'm assuming that $C$ is a closed convex set.)

2
On

As noted in the comments, this answer does not give an exact solution.

Using an extra variable $t_i$, which is used to model $f_i(w)$, the problem can be represented as a quadratically constrained quadratic optimization problem (QCQP): $$ \begin{align*} \min \quad & t_i^2 \\ \mathrm{s.t.} \quad & t_i \geq w^T A_i w - z_i \\ & \sum w_i = 1 \\ & w \geq 0 \end{align*} $$ You may wonder why the first constraint uses "$\geq$" instead of "$=$". The reason is that it ensures convexity (given two feasible solutions, the line segment between those solutions is also feasible). However, as noted in the comments, the formulation is not equivalent.

The advantage of writing this problem as a QCQP is that there is a plethora of algorithms to solve it. I will focus on interior point methods, which find the (globally) optimal solution in 40-60 iterations independent of the size of your problem. Matlab has quadprog. Commercial software like CPLEX, Mosek, Gurobi has bindings for many languages, including Matlab and Python. QCQP is a special case of convex optimization, so in Python you could use scipy.optimize.