Consider running gradient descent (GD) on the following optimization problem:
$$\arg\min_{\mathbf x \in \mathbb R^n} \| A\mathbf x-\mathbf b \|_2^2$$
where $\mathbf b$ lies in the column space of $A$, and the columns of $A$ are not linearly independent. Is it true that GD would find a solution with minimum norm? I saw some articles (e.g., 1705.09280) that indicated so, but I couldn't find a proof, searching on the internet for a while.
Can someone confirm or refute it? And if it's true, a proof or a reference to the proof would be much appreciated!
EDITS 2019/11/27:
Thanks to littleO's answer, apparently the answer to this question is no in general. However, I'm still curious about the following:
Follow-up Question: Are there some constraints under which the answer is yes? Is it true that, as Clement C. suggested, if we initialize $\mathbf x$ in the range of $A^\top$, then GD finds the minimum-norm solution? Is this a sufficient condition or is it also necessary?
It appears to me that the answer is yes, if and only if we initialize $\mathbf x$ in the range of $A^\top$.
I'll list my arguments below and would appreciate it if someone would confirm it or point out where I'm mistaken.
My arguments: Let $f(\mathbf x)= \| A\mathbf x-\mathbf b \|_2^2$. Then $\nabla_{\mathbf x}f(\mathbf x) = 2A^\top(A\mathbf x - \mathbf b),$ and GD iterates as follows: $\mathbf x^{(t+1)}=\mathbf x^{(t)}-\eta \nabla_{\mathbf x}f(\mathbf x^{(t)})$. Note that all GD updates are in the range of $A^\top$. Hence we may write $\mathbf x^{(t)}=\mathbf x^{(0)}+A^\top \mathbf u$ for some vector $\mathbf u$.
Sufficiency: Suppose $\mathbf x^{(0)}$ is also in the range of $A^\top$, i.e. $\mathbf x^{(0)}=A^\top \mathbf v$. Then $\mathbf x^{(t)}=A^\top (\mathbf v+\mathbf u).$ Since $f(\mathbf x)$ is convex, we know that GD will converge to a global minimum ($0$) if the step size is small enough. Denote this by $\mathbf x^{(t)} \to \mathbf x^* = A^\top \mathbf u^*$. Hence $A\mathbf x^*-\mathbf b=AA^\top \mathbf u^*-\mathbf b=\mathbf 0$, so $\mathbf u^*=(AA^\top)^{-1}\mathbf b$ (assuming $A$ is full rank), and $\mathbf x^*=A^\top (AA^\top)^{-1}\mathbf b$, which is the well-known minimum norm solution. (If $A$ is not full (row) rank, we can delete some redundant rows.)
Necessity: Now suppose $\mathbf x^{(0)} \notin \mathrm{range}(A^\top)$, and $\mathbf x^{(t)} \to \mathbf x^*$. We necessarily have $\mathbf x^* = A^\top \mathbf u^* + \mathbf x^{(0)}$ for some $\mathbf u^*$. However, clearly $\mathbf x^*\notin \mathrm{range}(A^\top)$, so it cannot possibly be the (unique) minimum norm solution, $ A^\top (AA^\top)^{-1}\mathbf b$.
From the paper [0] in question:
Given fat matrix $\mathrm A \in \mathbb R^{m \times n}$ ($m < n$) and vector $\mathrm b \in \mathbb R^m$, consider the following linear system in $\mathrm x \in \mathbb R^n$
$$\rm A x = b$$
where $\rm A$ has full row rank. Let the singular value decomposition (SVD) of $\rm A$ be as follows
$$\mathrm A = \mathrm U \Sigma \mathrm V^\top = \mathrm U \begin{bmatrix} \Sigma_1 & \mathrm O \end{bmatrix} \begin{bmatrix} \mathrm V_1^\top \\ \mathrm V_2^\top \end{bmatrix} = \mathrm U \Sigma_1 \mathrm V_1^\top$$
The least-norm solution of $\rm A x = b$ is given by
$$\mathrm x_{\text{LN}} := \mathrm A^\top \left( \mathrm A \mathrm A^\top \right)^{-1} \mathrm b = \cdots = \mathrm V_1 \Sigma_1^{-1} \mathrm U^\top \mathrm b$$
where the inverse of $\mathrm A \mathrm A^\top$ exists because $\rm A$ has full row rank.
Gradient descent
Let cost function $f : \mathbb R^n \to \mathbb R$ be defined by
$$f (\mathrm x) := \frac12 \left\| \rm{A x - b} \right\|_2^2$$
whose gradient is
$$\nabla f (\mathrm x) = \rm A^\top \left( A x - b \right)$$
Using gradient descent with step $\mu > 0$,
$$\begin{aligned} {\rm x}_{k+1} &= {\rm x}_k - \mu \nabla f ({\rm x}_k)\\ &= \left( {\rm I} - \mu {\rm A^\top A} \right) {\rm x}_k + \mu {\rm A^\top b}\end{aligned}$$
Hence,
$${\rm x}_k = \left( {\rm I} - \mu {\rm A^\top A} \right)^k {\rm x}_0 + \mu \sum_{\ell = 0}^{k-1} \left( {\rm I} - \mu {\rm A^\top A} \right)^{\ell} {\rm A^\top b}$$
Letting $\rm y := V^\top x$, we rewrite
$$\begin{aligned} {\rm y}_k &= \left( {\rm I} - \mu \Sigma^\top \Sigma \right)^k {\rm y}_0 + \mu \sum_{\ell = 0}^{k-1} \left( {\rm I} - \mu \Sigma^\top \Sigma \right)^{\ell} \Sigma^\top {\rm U^\top b}\\ &= \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^k & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} {\rm y}_0 + \mu \sum_{\ell = 0}^{k-1} \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^{\ell} & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} \begin{bmatrix} \Sigma_1\\ \mathrm O \end{bmatrix} {\rm U^\top b}\\ &= \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^k & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} {\rm y}_0 + \mu \sum_{\ell = 0}^{k-1} \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^{\ell} \Sigma_1\\ \mathrm O\end{bmatrix} {\rm U^\top b} \end{aligned}$$
Choosing $\mu > 0$ such that all eigenvalues of ${\rm I} - \mu \Sigma_1^2$ are strictly inside the unit circle, then ${\rm y}_k \to {\rm y}_{\infty}$, where
$${\rm y}_{\infty} = \begin{bmatrix} \mathrm O & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} {\rm y}_0 + \mu \sum_{\ell = 0}^{\infty} \begin{bmatrix} \left( {\rm I} - \mu \Sigma_1^2 \right)^{\ell} \Sigma_1\\ \mathrm O\end{bmatrix} {\rm U^\top b}$$
where
$$\mu \sum_{\ell = 0}^{\infty} \left( {\rm I} - \mu \Sigma_1^2 \right)^{\ell} \Sigma_1 = \mu \left( {\rm I} - {\rm I} + \mu \Sigma_1^2 \right)^{-1} \Sigma_1 = \Sigma_1^{-1}$$
and, thus,
$${\rm y}_{\infty} = \begin{bmatrix} \mathrm O & \mathrm O\\ \mathrm O & \mathrm I\end{bmatrix} {\rm y}_0 + \begin{bmatrix} \Sigma_1^{-1} \\ \mathrm O\end{bmatrix} {\rm U^\top b}$$
Since $\rm x := V y$,
$$\boxed{ \,\\\quad {\rm x}_{\infty} = {\rm V}_2 {\rm V}_2^\top {\rm x}_0 + \underbrace{{\rm V}_1 \Sigma_1^{-1}{\rm U^\top b}}_{= \mathrm x_{\text{LN}}} \quad\\}$$
Therefore, we conclude that if ${\rm x}_0$ is orthogonal to the null space of $\rm A$, then gradient descent will converge to the least-norm solution.
[0] Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro, Implicit Regularization in Matrix Factorization, May 2017.
optimization numerical-optimization convex-optimization quadratic-programming gradient-descent least-squares least-norm matrices svd