Let us assume that we are trying to solve $\min_x f(x)$, $f$ differentiable.
Computing a full gradient $\nabla f$ is more expensive that computing the gradient on a single coordinate $\nabla_i f$.
Is there any general theory that says that, say, for a convex function $f$, gradient descent is always better than coordinate descent? If not, why not use always coordinate descent, since it is cheaper? To be specific, I am comparing
$x^{t+1} = x^{t } - \alpha \nabla f(x^t)$
and
$x^{t+1}_i = x^{t }_i - \alpha \nabla_i f(x^t)$, where in each iteration $i$ is choose sequentially and in a cycle, or randomly, or using some other scheme.
For example, if $f(x = (x_1,x_2)) = \frac{1}{2}x_1^2 + \frac{\kappa}{2}x^2_2$, $\kappa \geq 1$, then the best possible rate of convergence for gradient descent is $\min_{\alpha} \max\{|1 - \alpha|,|1 - \alpha\kappa |\} = \frac{\kappa-1}{\kappa + 1}$. For two iterations, this rate is $\left( \frac{\kappa-1}{\kappa + 1}\right)^2$.
If, however, I do coordinate descent, alternating between descent in $x_1$ and descent in $x_2$, then I get a best possible rate of convergence for two iterations as $\min_{\alpha} \max\{|1 - \alpha|,|1 - \alpha\kappa + \alpha^2 \kappa|\}$, which, for certain values of $\kappa$, is smaller than $\left( \frac{\kappa-1}{\kappa + 1}\right)^2$.
So it seems that, even in simple examples, it is not clear that the more expensive gradient descent is better. Why use it then?
Here are some pictures of rate vs alpha (blue is coordinate descent), for $\kappa < 3.4$ and $\kappa > 3.4$: