Least squares coordinate-wise smoothness

406 Views Asked by At

What is coordinate-wise smoothness constant $L_i$ for least squares objective: $$f(\mathbf{x}):=\frac{1}{2}\|A \mathbf{x}-\mathbf{b}\|^{2}$$

We define coordinate-wise smoothness as: $$f\left(\mathbf{x}+\gamma \mathbf{e}_{i}\right) \leq f(\mathbf{x})+\gamma \nabla_{i} f(\mathbf{x})+\frac{L_i}{2} \gamma^{2} \quad \forall \mathbf{x} \in \mathbb{R}^{d}, \forall \gamma \in \mathbb{R}$$ A is a $m \times n$ matrix, $A=\left[\mathbf{a}_{1}, \ldots \mathbf{a}_{n}\right] \text { with columns } \mathbf{a}_{i}$ where $e_i$ is a standard basis vector, namely e.g: $e_i \in R^d$ is the vector where ith coordinate is 1 and all other are 0. I seem to fail to get any of the possible answers. I plugged in the objective to the definition of coordinate-wise smoothness, my steps:

$$ ||A(x+\gamma e_i)-b|| = ||Ax + \gamma A e_i - b||$$ $$ ||Ax + \gamma A e_i - b|| = ||Ax-b + \gamma a_i||$$ by triangle inequality: $$ ||Ax-b + \gamma a_i|| \leq ||Ax-b|| + \gamma||a_i||$$ $$ ||Ax-b + \gamma a_i||^2 \leq ||Ax-b||^2 + \gamma^2||a_i||^2 + 2||a_i||||Ax-b||$$ $$ \frac{1}{2}||Ax-b + \gamma a_i||^2 \leq \frac{1}{2}||Ax-b||^2 + \frac{\gamma^2}{2}||a_i||^2 + ||a_i||||Ax-b||$$

so it seems like it could be $||a_i||^2$ but the last term does not match ith coordinate of the gradient in my opinion so the first definition doesn't hold. On the other hand I found in other textbook that general smoothness constant for least squares is $\lambda_{max}(A^TA)$. Any guidance how to proceed or hint of what I am overlooking would be greatly appreciated, thanks!

1

There are 1 best solutions below

2
On

The smoothness parameter w.r.t. the $l_2$ norm is $\lambda_{\max}(A^T A)$.

The reason is that $f$ is L-smooth w.r.t. the $l_p$ norm if and only if $\|\nabla^2 f(\mathbf{x})\|_{p,q}\leq L$, where $p,q \geq 1$ and $\frac{1}{p}+\frac{1}{q}=1$. The proof of this Theorem is not very complicated and you can easily find it online.

Since in your case $\nabla^2 f(\mathbf{x}) = A^TA$ the result follows immediately. If you wish to look at it "component-wise", then simply look at the $i$-th column of A. Then an individual component $L_i = \lambda_{\max}(A_i^T A_i)$, and clearly $\forall i, \;L_i\leq L$.