We've done a bunch of theoretical stuff in my optimization class, but basically no time for the actual implementation details. I'm trying to get an understanding of coordinate descent, which if I'm understanding it correctly utilizes the Gauss-Sediel method for solving a set of linear equations.
What I'm not getting is WHICH system of linear equations we're solving.
In the case of $f$ being convex, then $\text{min. } f(x) \Leftrightarrow \nabla f(x) = 0$, which leads to a set of equations, but they're only linear when $f$ is quadratic... It gets even worse if we're relaxing some constraint, in particular a convex, separable but nondifferentiable constraint - here we can't lean on the gradient to produce a set of equations, let alone a linear set of equations.
So where does the linear set of equation solving come into the picture? What are the actual computational steps for coordinate descent?
What is the Gradient Descent Method?
Let's generalize it by the Steepest Descent Method.
We'll define the Steepest Descent using the following optimization problem:
$$ \arg \min_{ \left\| d \right\|_{p} \leq 1} \left\{ {\nabla f \left( x \right)}^{T} d \right\} $$
If we chose $ p = 2 $ you will get the known Gradient Descent.
The interesting thing happens when you chose $ p = 1 $.
Then the optimal direction (Easily derived form the KKT) is given by:
$$ {d}_{i} = \begin{cases} -\operatorname{sign} \left( {\nabla f \left( x \right)}_{i} \right) & \text{ if } \left| {\nabla f \left( x \right)}_{i} \right| = \max_{i} \left\{ \left| {\nabla f \left( x \right)}_{i} \right| \right\} \\ 0 & \text{ if } \left| {\nabla f \left( x \right)}_{i} \right| \neq \max_{i} \left\{ \left| {\nabla f \left( x \right)}_{i} \right| \right\} \end{cases} $$
Namely the Steepest Descent in the $ {L}_{1} $ Norm sense is taking the direction which is aligned to coordinate with the steepest slope.
This is basically Coordinate Descent, solve the problem in a direction aligned with single coordinate.
So you can think of Coordinate Descents something like Steepest Decedent in $ {L}_{1} $ Norm.