A colleague was explaining to me that the Frank-Wolfe algorithm is a descent algorithm (i.e. its objective value decreases monotonically at each iteration). However, when I tried simulating it, my curve is not monotonically decrease, but does converge. It's possible I'm just a bad coder, but can someone point me to a proof somewhere that shows Frank-Wolfe is a descent method?
Is Frank Wolfe a descent algorithm?
629 Views Asked by Bumbble Comm https://math.techqa.club/user/bumbble-comm/detail AtThere are 2 best solutions below
On
Edit: totally found an error in my own proof. ignore this please.
Nevermind. I managed to prove it. Since I had some difficulty finding the proof, I'll just post the answer here.
The Frank-Wolfe algorithm solves
min $f(x)$ subject to $x \in D$
using the iteration
$x^+ = (1-\gamma) x + \gamma s^*$
for some $\gamma \in (0,1)$, and $s^* = \arg\min_s \langle s, \nabla f(x) \rangle$.
From convexity we have
$f(x^+) \leq (1-\gamma) f(x) + \gamma f(s^*)$.
Since $s^*$ is the solution of a convex optimization problem, we have
$\nabla f(s)^T(x-s)\geq 0, \forall x\in D$.
[Edit: the optimality condition should be $\nabla f(x)^T(x-s)\geq 0, \forall x\in D$, not what's written. So this proof is incorrect.]
Finally, by first order condition, we have $f(s^*) - f(x) \leq \nabla f(s)^T(s-x)$.
Mix and match and we get $f(x^+) - f(x) \leq 0$.
I think it depends on how you choose $\gamma$. If you choose it using line search, then fine (see this proof), otherwise I think in general, it does not. Consider the case $f(x) = x^2$ with $D = [-1,1]$. Starting from $x_0=1/2$ and using the rule $\gamma = \frac{2}{k+2}$ (fron the wikipedia page you linked), we have: $$ \begin{array}{rcccc} k: & 0 & 1 & 2 & \dots \\ x_k: & 1/2 & -1 & 1/3 & \dots \\ f(x_k): & \mathbf{1/4} & \mathbf{1} & \mathbf{1/9} & \dots \\ s\,\nabla f(x_k): & s & -2 \, s & 2/3 \, s & \dots \\ s_k: & -1 & 1 & -1 & \dots \\ \gamma_k: & 1 & 2/3 & 1/2 & \dots \\ \end{array} $$ As you can see the value of the cost function starts from $1/4$, then goes up to $1$ before decreasing to $1/9$.
Your proof seems to suggest that the algorithm is a descent algorithm for an arbitrary choice of $\gamma$. I think there is an error is the step:
In fact, this would hold if the function being minimized was $f(s)$ while $s^*$ is the solution of the tangent problem "$\operatorname{minimize} \nabla f(x_k)^T s$".