Proximal Operator for the Logistic Loss Function

1.3k Views Asked by At

I am reading the ADMM paper by S. Boyd et al - Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.

I'm interested in implementing a L1 regularized feature wise distributed multinomial logistic regression solver using ADMM. So I read page 70, section 8.3.3 of Boyd's paper with great interest. However, I am a bit confused.

Boyd gives the z update as follows:

$$ \bar{z}^{k+1} := \underset{\bar{z}}{\text{argmin}}\left(l(N\bar{z}) + (\rho / 2) ||\bar{z} - \overline{Ax}^{k+1} - u^k||_{2}^{2}\right) $$

... where $ l \left( \cdot \right) $ is the logistic loss function. I understand that this is the proximity operator (Thanks to Derivation of Soft Thresholding Operator / Proximal Operator of L1 Norm) but:

1) I'm not sure how to solve this expression. Does anyone know the logistic proximity operator or how to efficiently derive it?

2) Boyd points out that the logistic proximity operator "can be very efficiently computed by a lookup table that gives the approximate value, followed by one or two Newton steps (for a scalar problem)." Does anyone have any insight into how to do this?

Thank you!

2

There are 2 best solutions below

0
On

I am assuming that in your case $\ell(x) = \ln(1 + \exp(-c^T x))$ for some vector $c$. I am not aware of any closed-form expression for the optimal $\bar{z}^{k+1}$, but it can easily be computed via Newton's method. You should read Boyd's book "Convex Optimization", where the method is described.

0
On

The logistic loss function is given by:

$$ \ell \left( x \right) = \ln \left( 1 + {e}^{-{c}^{T} x} \right), \, x \in \mathbb{R}^{n} $$

So the Prox Operator is given by:

$$ \operatorname{Prox}_{\lambda \ell \left( \cdot \right)} \left( y \right) = \arg \min_{x} \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} + \lambda \ln \left( 1 + {e}^{-{c}^{T} x} \right) = \arg \min_{x} f \left( x \right) $$

The above is a smooth convex function. Hence any stationary point is a minimum.
Looking at its derivative yields:

$$\begin{aligned} \frac{\partial f \left( x \right) }{\partial x} & = x - y + \lambda \frac{\partial \ln \left( 1 + {e}^{-{c}^{T} x} \right) }{\partial x} \\ & = x - y - \lambda \frac{ {e}^{-{c}^{T} x} }{1 + {e}^{-{c}^{T} x}} c \end{aligned}$$

There is no closed form when the derivative vanishes.
As @AlexShtof suggested you could use Newton Method to solve this.

Yet since we have nice form $ x - \left( y + \lambda \frac{ {e}^{-{c}^{T} x} }{1 + {e}^{-{c}^{T} x}} c \right) = x - g \left( x \right) $ we could use a related yet simpler method - Fixed Point Iteration.

The idea is to set $ {x}^{n + 1} = g \left( {x}^{n} \right) $. By the Fixed Point Iteration the series will converge to a point where $ \lim_{n \to \infty} g \left( {x}^{n} \right) = {x}_{0} $ where $ g \left( {x}_{0} \right) = {x}_{0} $, namely a fixed point of $ g \left( \cdot \right) $. Yet this point is also where the derivative of our function vanishes, namely the solution to the Prox Operator.

Convergence Analysis

Applying Fixed Point Iteration method on $ g \left( x \right) $ can be seen as Gradient Descent Method from the point of vie of $ f \left( x \right) $:

$$ {x}^{k + 1} = g \left( {x}^{k} \right) = {x}^{k} - \left( {x}^{k} - g \left( {x}^{k} \right) \right) = {x}^{k} - \nabla f \left( x \right) $$

As can be seen above, it is a Gradient Descent with constant step size of $ 1 $.
For a function $ f \left( x \right) $ which is $ L $ Lipschitz Continuous Gradient means:

  1. The Hessian of the function obeys $ {H}_{f} \left( x \right) \preceq L I $.
  2. The gradient descent will converge for step size $ \alpha \leq \frac{2}{L} $.

From the above we have $ 1 \leq \frac{2}{L} \Rightarrow L \leq 2 $ and $ {H}_{f} \left( x \right) = I - \nabla g \left( x \right) \preceq L I \preceq 2 I \Rightarrow \nabla g \left( x \right) \succeq -I $.

In our case $ \nabla g \left( x \right) = -\lambda \frac{ {e}^{-{c}^{T} x} }{ {\left( 1 + {e}^{-{c}^{T} x} \right)}^{2} } c {c}^{T} $ which is a rank 1 matrix with its eigen value given by $ -\lambda \frac{ {e}^{-{c}^{T} x} }{ {\left( 1 + {e}^{-{c}^{T} x} \right)}^{2} } {c}^{T} c $. From the requirement $ \nabla g \left( x \right) \succeq -I $ we can derive the requirement:

$$ -\lambda \frac{ {e}^{-{c}^{T} x} }{ {\left( 1 + {e}^{-{c}^{T} x} \right)}^{2} } {c}^{T} c \geq -1 \Rightarrow \lambda {c}^{T} c \leq \frac{1}{4} $$

The above is because $ \nabla g \left( x \right) $ is bounded: $ \nabla g \left( x \right) \in \left( 0, 0.25 \right], \, x \in \mathbb{R}^{n} $.

The above gave me an interesting observation. One could minimize $ f \left( x \right) $ using Newton Method with guaranteed convergence yet applying Newton Like method (Fixed Point Iteration) to its derivative for root finding isn't guaranteed to converge. I analyzed the convergence deeper in my answer to Fixed Point Iteration for the Gradient of Smooth Convex Function.

A MATLAB code, including validation using CVX, can be found in my StackExchange Mathematics Q1683654 GitHub Repository.
The MATLAB code also includes CVX Code for reference and Gradient Descent based solver.