Regularized least squares solution norm bounded by least squares solution norm

732 Views Asked by At

I was reading a book on numerical methods and come across with this one:

Consider the least squares problem in Euclidean norm: $$\min_x \|b-Ax\|_2,$$ where we know that $A$ is ill-conditioned. Let's replace the normal equation with a better-conditioned system $$(A^TA + \gamma I)x_{\gamma} = A^Tb,$$ where $\gamma > 0$ parameter and $I$ is the appropriate size identity matrix.

The exercise is to show that $$\|x_{\gamma}\| \leq \|x\|$$

How can I show this?

My idea was that to write $x = (A^TA)^{-1}A^Tb$ and $x_{\gamma} = (A^TA + \gamma I)^{-1}A^Tb$ and tried to give an upper bound on $x_{\gamma}$ and have rewritten the problem as $$\frac{1}{\gamma}\|(N + I)^{-1}\| \leq \|N^{-1}\|$$

3

There are 3 best solutions below

0
On BEST ANSWER

Assume that $A^T A$ is invertible. Then by the spectral theorem, for any unit vector $y$, $\| (A^T A + \gamma I)^{-1} y \|^2 = \sum_i c_i(y)^2 (\lambda_i+\gamma)^{-2}$ and $\| (A^T A)^{-1} y \|^2 = \sum_i c_i(y)^2 \lambda_i^{-2}$ where $c_i(y)=q_i^T y$ where $q_i$ are orthonormal eigenvectors of $A^T A$. The orthogonality is crucial here.

When $A^T A$ is not invertible you have a problem: if "inverse" means "Moore-Penrose pseudoinverse", then the zero eigenvalues of $A^T A$ contribute to $\| (A^T A + \gamma I)^{-1} y \|^2$ but not to $\| (A^T A)^{-1} y \|^2$. Maybe the method can be adapted to this case, but you'll need some auxiliary conditions somewhere. (For example, the result clearly fails if $A=0$ and we interpret $(A^T A)^{-1}$ as the Moore-Penrose pseudoinverse.)

4
On

Unfortunately, I can not provide you with a clear proof of your statement in the general case where $A^TA$ is not invertible, but might give you some intuiton and lead you in the right direction.

This Tikhonov penalization is called "ridge regression" in Statistics. From there it is well known that your solution can be seen as the solution of the following minimization problem: $$\min_{x\in R^p}||b-Ax||_2^2, \quad s.t. \sum_{i=1}^px_i^2 \leq t,$$ i.e. $(A^TA+\gamma I)^{-1}A^Tb$ is the solution to the following dual problem: $$\min_{x\in R^p}||b-Ax||_2^2 + \gamma \sum_{i=1}^px_i^2.$$

It is immediately clear that you are performing a restricted maximization problem, and hence for its estimate $x_{\gamma}$ you will obviously get $\sum_{i=1}x_{\gamma>0,i}^2\leq \sum_{i=1}x_{\gamma=0,i}^2$, where $x_{\gamma=0}$ is the solution of the unrestricted optimization problem.

Assuming $A'A = Z$ to be invertible (which follows immediately, if we assume $A$ of full column rank). You can then write: \begin{align*} x_{\gamma} & = (A^TA+\gamma I)^{-1}A^Ty\\ & = (Z+\gamma I)^{-1}Z(Z^{-1}A^Ty)\\ & = (Z+\gamma Z Z^{-1})^{-1}Z((A'A)^{-1}A^Ty\\ & = (Z(I+\gamma Z^{-1})^{-1}Z((A'A)^{-1}A^Ty\\ & = (I+\gamma Z^{-1})^{-1}x_{\gamma=0} \\ & = (I-\gamma Z^{-1}(I+\gamma Z^{-1})^{-1})x_{\gamma=0}\\ & = x_{\gamma=0} -\gamma Z^{-1}(I+\gamma Z^{-1})^{-1}x_{\gamma=0} \end{align*}

Note: assuming $Z=I$ we then have as a special case:

$$x_{\gamma} = \frac{1}{1+\gamma}x_{\gamma=0}.$$

edit: something was wrong here. the assertion then does not directly follow from the triangle inequality as it was applied here.

edit 2: Using the SVD of $A$, i.e. $A=UDV^T$ and let $\delta_i$ be the $i$-th singular value of $A$. It then follows from the above discussion that we have $$x_{\gamma} = VD^*V^Tx_{\gamma=0}= x_{\gamma=0} - VD^{**}V^T x_{\gamma=0}.$$ where $D^*$ and $D^{**}$ are diagonal matrix with entries $\frac{\delta_i^2}{\delta_i^2+\gamma}$ and $\frac{1}{\delta_i^2/\gamma +1}$ respectively. While this representation is enough to get a feeling for the effect of shrinkage towards $0$, (if $l\to \infty$, then obviously we have $VD^*V^T \to 0$ and $VD^{**}V^T \to I$ ) some more detailed calculation would still be needed to compare the norms. Ians approach seems to be a straightforward way to go.

edit 3: Using matrix norms the solution also follows immediately from the calculations above:

Note that the eigenvalues of the symmetric matrix $B:=VD^*V^T$ are $\lambda_i :=\frac{\delta_i^2}{\delta_i^2 + \gamma}$, where $\lambda_i \leq 1$ since $l\geq 0$. Moreover let $||B||$ be the spectral norm of the matrix $B$.

We then have $$||B|| = \sqrt{\lambda_{max}(B^TB)} = \max_{i}|(\lambda_i)|\leq 1.$$

Now Let $||x||_v$ be a compatibel vector norm (i.e. the euclidean norm) and remember that we found $x_\gamma = B X_{\gamma=0}$ during our calculations. We then have:

$$||x_\gamma||_v = || B X_{\gamma=0}||_v \leq ||B|| \cdot ||x_{\gamma=0}||_v \leq 1 ||x_{\gamma=0}||_v \leq ||x_{\gamma=0}||_v.$$

This proofs the assertion. It follows from the steps above that strict inequality holds for all $l>0$.

0
On

This answer is quite late, but I hope it will be useful to someone else.

Let $x = \arg\min_z {\lVert Az-b \rVert_2} $, then $A^TAx = A^Tb$ by the optimality condition. We have $x_\gamma = (A^TA + \gamma I)^{-1} A^Tb = (A^TA + \gamma I)^{-1}A^TAx$. Then $$\lVert x_\gamma \rVert_2 \le \lVert (A^TA + \gamma I)^{-1} A^TA \rVert_2 \lVert x \rVert_2.$$ It suffices to show that $\lVert (A^TA + \gamma I)^{-1} A^TA \rVert_2 \le 1$.

Let $A = U \Sigma V^T$ be the SVD of $A$, where $U$ and $V$ are unitary and $\Sigma = \text{diag}(\sigma_1,\sigma_2, \ldots, \sigma_k, 0 \ldots, 0)$ is diagonal ($k$ is the rank of $A$ and $\sigma_i > 0$ is the singular value). Then $A^TA = V \Sigma^2V^T$.

Hence, \begin{align} (A^TA + \gamma I)^{-1} A^TA &= (V \Sigma^2V^T + \gamma I)^{-1} V \Sigma^2V^T \\ &= V^{-T}(\Sigma^2 + \gamma I)^{-1} V^{-1} V \Sigma^2V^T \\ &= V^{-T}(\Sigma^2 + \gamma I)^{-1} \Sigma^2 V^T. \end{align} Since the 2-norm is invariant under multiplication by unitary matrices and $\gamma > 0$, then \begin{align} \lVert (A^TA + \gamma I)^{-1} A^TA \rVert_2 &= \lVert (\Sigma^2 + \gamma I)^{-1} \Sigma^2 \rVert_2 \\ &= \left\lVert \text{diag}\left(\frac{{\sigma_1}^2}{{\sigma_1}^2+\gamma},\frac{{\sigma_2}^2}{{\sigma_2}^2+\gamma}, \ldots, \frac{{\sigma_k}^2}{{\sigma_k}^2+\gamma}, 0 \ldots, 0 \right) \right\rVert_2 \\ &= \max_{1 \le i \le k} \left( \frac{{\sigma_i}^2}{{\sigma_i}^2+\gamma} \right) \\ & \le 1, \end{align} which completes the proof that $\lVert x_\gamma \rVert_2 \le \lVert x \rVert_2 $