In gradient descent on a quadratic problem,
$$\min _{x \in \mathbb{R}^n}\frac12 \langle Hx, x \rangle + \langle b, x \rangle + c \qquad H \text{ symmetric, positive semi-definite}\quad b \in \operatorname{Range}(H)$$
the condition number of the Hessian, $\kappa(H)$, is related to the convergence rate via $\frac{\kappa - 1}{\kappa + 1}$. Suppose $H$ is singular and positive semi-definite. Then $\kappa(H)$, depending on the definition of $\kappa$, might be infinite. But I think in this case I should be concerned only with $\tilde{\kappa}(H) = \frac{\lambda_1(H)}{\lambda_k(H)}$, where $\lambda_k$ is the largest non-zero eigenvalue of $A$. ($k = \operatorname{Rank}(H)$). And I shouldn't worry about the nullspace of $H$.
My logic is: since $b \in \operatorname{Range}(A) = \operatorname{Nullspace}(A)^\perp$, I can really treat this problem as optimization in $\operatorname{Range}(A) \subset \mathbb{R}^n$, where $A|_{\operatorname{Range}(A)}$ is nonsingular, right? Because no component of $x_k - x_{k-1}$ along $\operatorname{Nullspace}(A)$ will either affect the objective function nor will it ever show up in the gradient $Ax + b \in \operatorname{Range}(A)$, so it should never show up in gradient descent in the first place (when we're dealing with a quadratic with a min)?
Now suppose I'm right about the above. Suppose we're in a more complicated problem (say a machine learning problem), and we're looking at $\kappa(H)$ (which now depends on $x$) as a signal for how bad our optimization problem is. (We hope it indicates whether we'll encounter narrow valleys, etc.) Why might or might not I be justified in considering the $\tilde{\kappa}(H)$ as opposed to $\kappa(H)$, when the latter is possibly infinite?
Some ideas:
- In a general optimization problem, it's impossible to tell from merely local information whether we're justified in ignoring a certain subspace in our gradient search for an optimum.
- In practice, we'll be dealing with computer error and inexact quantities, so there really isn't much difference between problems where $H$ is non-invertible, and problems where it is very poorly conditioned.
- Some algorithms such as BFGS require the Hessian to be invertible to come up with their next step, so this is a problem.
What other considerations might I want to be aware of?