Why Does the Projected Gradient Descent Method Work?

4.2k Views Asked by At

Consider the problem \begin{align*} \min_{x \in \mathbb{R}^n} &\quad f(x) \\ s.t.: &\quad x \in C, \end{align*} where $C$ is a convex set. As $C$ is convex, the projection onto $C$, $P_C$, is well defined for every $x \in \mathbb{R}^n$. The projected gradient method is a method that proposes solving the above optimization problem taking steps of the form $x_{t+1} = P_C[x_t - \eta \nabla f(x_t)]$.

It is well known that for unconstrained problems the gradient has the maximum slope direction, but why does it still work after projecting? Namely, is there a result in the way of the following?

"Exists $\varepsilon > 0$ so that for $0 < \delta < \varepsilon$, we have $f(P_C[x - \delta\nabla f(x)]) \le f(x)$"

As a bonus question, when there are more than one convex restrictions (like $x \in C$ and $x \in D$, with both convex sets), there is an alternated projection method that consists in projecting repeteadly to $C$ and then to $D$. The sequence of projections converge to a point in $C\cap D$, if $C \cap D \ne \emptyset$. But that limit is not necessarily the projection of the initial point onto $C \cap D$, so why alternated projections can work with gradient descent?

2

There are 2 best solutions below

1
On BEST ANSWER

I think a rigorous proof of convergence is not so easy but can certainly be found in several books on the topic. I will give you a heuristic approach to see the plausibility of the algorithm. I focus on the first part of your question for now.

Heuristic arguments

  • While going into the direction of the negative gradient $-\nabla f(x)$ leads to descent, this is not the only direction with this property. Actually, for every direction $d\in\Bbb R^n$ with $\def\<{\langle}\def\>{\rangle}\<d,\nabla f(x)\> < 0$ there is some $\epsilon > 0$ so that $$f(x+\eta d)< f(x),\qquad\text{for all $\eta<\epsilon$}.$$ However, if $\<d,\nabla f(x)\>\ge 0$, then this direction leads to ascent (or at least no descent).

$\qquad\qquad\qquad\qquad\qquad\qquad\quad$

  • I assume by projection you mean $$P_C[x] := \mathop{\mathrm{argmin}}\limits_{c\in C} d(x,c).$$ There is some insightful way to find this projection: every $x\in\Bbb R^n$ can be written as $x=c+v$ in a unique way, where $c\in C$ and $v\in N_C(c)$ is from the normal cone (also called polar cone) of $C$ at $c$. We then have $P_C[x]=c$. Or to put it another way: $$P_C[x]=c\quad\Longleftrightarrow \quad x-c\in N_C(c).$$

$\qquad\qquad\qquad\qquad\qquad\qquad$

  • We now show that if $x_{t+1}=P_C[x_t-\eta \nabla f(x_t)]$ is different from the current point $x_t$, then it lies in the direction of descent. In other words: $x_{t+1}-x_t$ is a direction of descent, or $$\<x_{t+1}-x_t,\nabla f(x_t)\>< 0.$$ This is very evident from a picture, but it's a bit messy to prove. Also, this does not show that $f(x_{t+1})<f(x_t)$, but it makes it plausible that this will happen if $\eta$ is sufficiently small. So let's assume the opposite, that $x_{t+1}$ lies in the direction of ascent: $$\<x_{t+1}-x_t,\nabla f(x_t)\>\ge 0\qquad\implies\qquad (*)\;\; \<x_t-x_{t+1},\color{lightgray}{\underset{\begin{array}{c}\uparrow\\[-0.5ex] \llap{\text{adding a $\eta>0$ does not}}\rlap{\text{ hurt the inequality}}\end{array}}{\color{black}{\eta}}}\nabla f(x_t)\>\le 0.$$ Because the unconstrained gradient descent step $\color{red}{x_t - \eta\nabla f(x)}$ projects to $x_{t+1}$, it can be written as $\color{blue}{x_{t+1}+v}$ for some $v\in N_C(x_{t+1})$. Since $v$ is in the normal cone, this gives $$\<c-x_{t+1},v\>\le 0,\;\text{for all $c\in C$}\qquad\implies\qquad (**)\;\; \<x_t-x_{t+1},v\>\le 0.$$ By adding $(*)$ and $(**)$ we obtain $$(\times)\;\; \<x_t-x_{t+1},v+\eta\nabla f(x_t)\> \le 0.$$ But actually both factors of this inner product are equal, as evident from $$\color{red}{x_t-\eta\nabla f(x_t)} = \color{blue}{x_{t+1}+v}\qquad\implies\qquad x_t-x_{t+1} = v+\eta\nabla f(x_t).$$ So $(\times)$ gives us $\|x_t-x_{t+1}\|^2\le 0\implies x_t=x_{t+1}$. Thus, the only way to not make a move in a descent direction is to not move at all.

$\qquad\qquad\qquad\qquad\qquad\qquad$

  • It remains to show that not moving only happens if there is no descent direction available, hence that we are in a local minimum. If $x_t-\eta\nabla f(x)$ projects to $x_t$ again, then this means \begin{align} x_t-\eta\nabla f(x)-x_t\in N_C(x_t) \quad &\implies\quad -\nabla f(x)\in N_C(x_t) \\ &\implies\quad \<c-x_t,\nabla f(x_t)\> \ge 0,\; \text{for all $c\in C$}. \end{align} The last line means that moving in the direction of any point $c\in C$ (and in a convex set these are all available directions) will lead to ascent, hence that we are indeed in a local minimum.
0
On

There is a formal way to prove it. You may look at many Convex Optimization course notes to see the full proof.

In my intuition the reason it works is that the Projected Gradient Descent is actually alternating projection.
It is clear when you use the Proximal Gradient Method (Framework).