Moore-Penrose pseudoinverse and the Euclidean norm

3.1k Views Asked by At

Section 2.9 The Moore-Penrose Pseudoinverse of the textbook Deep Learning by Goodfellow, Bengio, and Courville, says the following:

Matrix inversion is not defined for matrices that are not square. Suppose we want to make a left-inverse $\mathbf{B}$ of a matrix $\mathbf{A}$ so that we can solve a linear equation

$$\mathbf{A} \mathbf{x} = \mathbf{y} \tag{2.44}$$

by left-multiplying each side to obtain

$$\mathbf{x} = \mathbf{B} \mathbf{y}. \tag{2.45}$$

Depending on the structure of the problem, it may not be possible to design a unique mapping from $\mathbf{A}$ to $\mathbf{B}$.

If $\mathbf{A}$ is taller than it is wide, then it is possible for this equation to have no solution. If $\mathbf{A}$ is wider than it is tall, then there could be multiple possible solutions. The Moore-Penrose pseudoinverse enables us to make some headway in these cases. The pseudoinverse of $\mathbf{A}$ is defined as a matrix

$$\mathbf{A}^+ = \lim_{\alpha \searrow 0^+}(\mathbf{A}^T \mathbf{A} + \alpha \mathbf{I} )^{-1} \mathbf{A}^T. \tag{2.46}$$

Practical algorithms for computing the pseudoinverse are based not on this definition, but rather on the formula

$$\mathbf{A}^+ = \mathbf{V} \mathbf{D}^+ \mathbf{U}^T, \tag{2.47}$$

where $\mathbf{U}$, $\mathbf{D}$ and $\mathbf{V}$ are the singular value decomposition of $\mathbf{A}$, and the pseudoinverse $\mathbf{D}^+$ of a diagonal matrix $\mathbf{D}$ is obtained by taking the reciprocal of its nonzero elements then taking the transpose of the resulting matrix.

When $\mathbf{A}$ has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution $\mathbf{x} = \mathbf{A}^+ \mathbf{y}$ with minimal Euclidean norm $\vert \vert \mathbf{x} \vert \vert_2$ among all possible solutions.

When $\mathbf{A}$ has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the $\mathbf{x}$ for which $\mathbf{A} \mathbf{x}$ is as close as possible to $\mathbf{y}$ in terms of Euclidean norm $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$.

It's this last part that I'm wondering about:

When $\mathbf{A}$ has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution $\mathbf{x} = \mathbf{A}^+ \mathbf{y}$ with minimal Euclidean norm $\vert \vert \mathbf{x} \vert \vert_2$ among all possible solutions.

When $\mathbf{A}$ has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the $\mathbf{x}$ for which $\mathbf{A} \mathbf{x}$ is as close as possible to $\mathbf{y}$ in terms of Euclidean norm $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$.

What I found confusing here is that the Euclidean norms $\vert \vert \mathbf{x} \vert \vert_2$ and $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$ seemingly come out of nowhere. Prior to this section, there is no discussion of the Euclidean norm -- only of the mechanics of the Moore-Penrose Pseudoinverse. And the authors then just assert this part without explanation.

So I am left wondering the following:

  1. Why is it that, when $\mathbf{A}$ has more columns than rows, then using the pseudoinverse gives us the solution $\mathbf{x} = \mathbf{A}^+ \mathbf{y}$ with minimal Euclidean norm $\vert \vert \mathbf{x} \vert \vert_2$ among all possible solutions?

  2. Why is it that, when $\mathbf{A}$ has more rows than columns, then using the pseudoinverse gives us the $\mathbf{x}$ for which $\mathbf{A} \mathbf{x}$ is as close as possible to $\mathbf{y}$ in terms of Euclidean norm $\vert \vert \mathbf{A} \mathbf{x} − \mathbf{y} \vert \vert_2$?

And what are the mechanics involved here?

I would greatly appreciate it if people would please take the time to clarify this.

3

There are 3 best solutions below

1
On

Eqn. (2.46) proposes to look at the minimizer $x_\alpha$ of the functional $$J_\alpha(x) := |A x - y|^2 + \alpha |x|^2.$$ For any finite $\alpha > 0$, the functional is strictly convex and has a unique minimizer $x_\alpha$; it is the smallest among those $x$ that produce the same residual magnitude $|A x - y|$. Minimization wrt $x$ gives $x_\alpha = (A^\top A + \alpha I)^{-1} A^\top y$. To see this, write the norm $|\cdot|^2$ in terms of the scalar product $\langle \cdot, \cdot \rangle$.

Ad 1. Suppose $A x = y$ has a solution $x^*$. The set of solutions is the convex set $(x^* + \ker A)$. So, there is only one solution that has minimal norm: the orthogonal projection of $0$ onto that set. As $\alpha \searrow 0$, the residual term becomes more important, and $A x = y$ is eventually enforced. Therefore, $x_0 := \lim_{\alpha \searrow 0} x_\alpha$ is the minimal-norm solution of $A x = y$.

Ad 2. If $A x = y$ has no solution, the residual $|A x - y|$ still has a minimum, which is selected for in the limit $\alpha \searrow 0$.

8
On

Let $x$ be $A^+y$.

  1. Let me begin by the second point. For all $z$, we have: \begin{align} \lVert Az-y \rVert_2^2 &= \lVert Ax-y \rVert_2^2 + \lVert A(z-x) \rVert_2^2 + 2 (z-x)^TA^T(Ax-y)\\ & \geq \lVert Ax-y \rVert_2^2 + 2 (z-x)^TA^T(Ax-y) \end{align} Moreover, because $(AA^+)^T = AA^+$, $$ A^T(Ax-y) = ((AA^+)A)^Ty - A^Ty = 0$$ Thus, we prove that for all $z$, $\rVert Az-y \lVert_2^2 \geq\rVert Ax-y \lVert_2^2$, that is to say $A^+y$ is as close as possible to $y$ in term of the Euclidian norm $\lVert Ax-y\rVert_2$.

  2. Now, let us suppose that there exist $z$ so that $Az=y$. According to the first point, we have $\rVert Ax-y\lVert_2=0$, so $x$ is a solution. Moreover, for all solution $z$, $$ \lVert z \rVert_2^2=\lVert x \rVert_2^2 + \lVert z-x \rVert_2^2 + 2x^T(z-x)$$ Yet, because $A^+Ax=x$ and $(A^+A)^T=A^+A$, $$x^T(x-z) = (A^+Ax)^T(x-z) = x^T(A^+Ax-z) = x^T(A^+y-z)=0$$ Thus, $\lVert z \rVert_2^2 \geq \lVert x \rVert_2^2$, that is to say that $x$ is the solution with the minimal Euclidian norm.

3
On

The answer to your first question follows easily by writing down the left inverse and the SVD of $A$ and $A^+$. When $A_{m\times n}$ has more columns than rows ($n>m$), it can be rewritten as $$A=UDV^T$$where $U_{m\times m}$ and $V_{n\times n}$ are unitary and $D_{m\times n}$ is diagonal. The Moore-Penrose pseudoinverse can be defined as $$A^+=VD^+U^T$$ where $D^+_{n\times m}$ is such that $$D^+D=\begin{bmatrix}I_{k\times k}&0_{k\times (n-k)}\\0_{(n-k)\times k}&0_{(n-k)\times (n-k)}\end{bmatrix}$$where $k\le m$ is the number of non-zero singular values of $A$ (For example if $D=\begin{bmatrix}2&0&0&0\\0&3&0&0\\0&0&0&0\end{bmatrix}$, then $k=2$ and $D^+=\begin{bmatrix}{1\over2}&0&0\\0&{1\over3}&0\\0&0&0\\0&0&0\end{bmatrix}$ which leads to $D^+D=\begin{bmatrix}1&0&0&0\\0&1&0&0\\0&0&0&0\\0&0&0&0\end{bmatrix}$). The system of variable then degrades to $$D^+DV^Tx=D^+U^Ty$$Since $||V^Tx||_2=||x||_2$ (rotation is an isometry), then by defining $w\triangleq V^Tx$ we can write$$D^+Dw=D^+U^Ty$$which induces constraints only on the first $k$ entries of $w$ (since only the first $k$ rows of $D^+D$ are linearly independent) and the remaining $n-k$ entries of $w$ are left without matter, such that if they are chosen equally zero, $w$ (and respectively $x$) touch their least possible $2$-norm (since $||x||_2^2=\sum_{i=1}^{n}|x_i|^2$).

To Be Updated...