Why is the Operator Norm so hard to calculate?

8.4k Views Asked by At

I recently took a better look at the operator norm defined on a matrix $\mathbf A \in \Bbb{K}^{n\times n}$ as follows:

$$ \|\mathbf A\|_p=\sup\{\|\mathbf Ax\|_p \mid x\in\Bbb{K}^n\land\|x\|=1\} $$

The first time I looked at this I thought "ok, lets calculate it for a few example matrices". I started with $n = 3$ and $p = 2$, just to start "simple". Let $$ \mathbf A = \left[\begin{matrix}a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33}\end{matrix}\right]\quad a_{ij}\in\Bbb{K}^n $$ Now if we're going to minimize $\|\mathbf Ax\|_2$ ($ = \|\mathbf Ax\|$), we might as well make it easy on ourselves and only minimize $\|\mathbf Ax\|^2$ so as to not worry about that annoying radical. We get $$ \begin{align} \|\mathbf Ax\|^2 & = (a_{11}x_1 + a_{12}x_2 + a_{13}x_3)^2 + (a_{21}x_1 + a_{22}x_2 + a_{23}x_3)^2 + (a_{31}x_1 + a_{32}x_2 + a_{33}x_3)^2 \\ & = Ax_1^2 + Bx_2^2 + Cx_3^2 + Dx_1x_2 + Ex_1x_3 + Fx_2x_3 \end{align} $$ where $$ \begin{align} A & = a_{11}^2+a_{21}^2+a_{31}^2 \\ B & = a_{12}^2+a_{22}^2+a_{32}^2 \\ C & = a_{13}^2+a_{23}^2+a_{33}^2 \\ D & = 2(a_{11}a_{12} + a_{21}a_{22} + a_{31}a_{33}) \\ E & = 2(a_{11}a_{13} + a_{21}a_{23} + a_{31}a_{33}) \\ F & = 2(a_{12}a_{13} + a_{22}a_{23} + a_{32}a_{33}) \end{align} $$ Now lets define $$ G(x_1,\ x_2,\ x_3) = Ax_1^2+Bx_2^2+Cx_3^2+Dx_1x_2+Ex_1x_3+Fx_2x_3 $$ So if we want to minimize $||\mathbf Ax||^2$, we're either going to have to minimize $$ N(x_1,\ x_2,\ x_3) = \frac{G(x_1,\ x_2,\ x_3)}{x_1^2+x_2^2+x_3^2} $$ or simply minimize $G$ with the constraint $g(x_1,\ x_2,\ x_3) = x_1^2 + x_2^2 + x_3^2 = 1$. The latter seemed easier to me, so I gave it a shot using Lagrange multipliers.

As usual, I defined $$ \mathcal{L}(x_1,\ x_2,\ x_3,\ \lambda) = G(x_1,\ x_2,\ x_3)-\lambda g(x_1,\ x_2,\ x_3) $$ setting it's gradient to zero gives $$ \nabla \mathcal L = 0 \implies \begin{cases} 2(A - \lambda)x_1 + Dx_2 + Ex_3 & = 0 \\ Dx_1 + 2(B - \lambda)x_2 + Fx_3 & = 0 \\ Ex_1 + Fx_2 + 2(C - \lambda)x_3 & = 0 \\ x_1^2 + x_2^2 + x_3^2 - 1 & = 0 \end{cases} $$ Now this is where I really started to get stuck. I tried solving the first three equations for $x_1,\ x_2,$ and $x_3$ but didn't end up with anything I could use. I tried solving for $x_1$ in terms of $x_2,\ x_3,$ and $\lambda$, then $x_2$ in terms of $x_3$ and $\lambda$, and then subbing that all into the third equation, but ended up with either $x_3 = 0$ or $$ 4\lambda^3 - 4\lambda^2(A+B+C) + \lambda(4AB+4AC+4BC-D2+E^2+F^2)-4ABC-AF^2-BE^2+CD^2+DEF = 0 $$ which, although technically solvable for $\lambda$ via the cubic equation, would be incredibly messy.

Now, I probably created my own roadblock for this problem, because I didn't want to think about the system of equations logically and just wanted to bash it out. Regardless of my approach, it seems like the operator norm is a very difficult thing to calculate, and I only analyzed the case where $n = 3$ and $p = 2$. What about the general case? What if $n = 75$ and $p = 9/4$? How on earth would you calculate it then?

The questions above are rhetorical, however, and my actual question is as follows:

Why define such an ordinary norm for matrices which is so difficult to calculate in general?

I see the operator norm everywhere, and it seems like the standard norm for a lot of theorems (unless I'm mistaken and ||A|| just means any matrix norm). So why would we define such a standard norm in a way that is so difficult to calculate? What's the point? Is it easy to work with in theorems? I get that it intuitively makes sense as a norm, but it can't possibly be that easy to work with, especially in comparison to things like the Frobenius norm.

So why do we care about this definition?

6

There are 6 best solutions below

0
On BEST ANSWER

First, as others have mentioned, the operator norm has many nice properties that make it convenient to use in proofs (most basically the fact that, by definition, it satisfies $\| Ax \| \le \| A \| \|x \|$). You might, for example, end up with factors of the operator norm in various bounds; even if you can't calculate the operator norm, if you can upper or lower bound it as appropriate then you can still extract information from these bounds. To really see the operator norm in action you can try learning some functional analysis; it really starts to be useful in the infinite-dimensional setting.

Second, here's how you calculate the operator norm (edit: when $p=2$). Let me assume that $A$ is real for simplicity although it doesn't matter much. You want to maximize $\langle Ax, Ax \rangle$ as $x$ ranges over all unit vectors. This is equivalent to maximizing

$$\langle A^T A x, x \rangle.$$

Now, unlike $A$, the matrix $A^T A$ is symmetric, and so by the spectral theorem it has an orthonormal basis of eigenvectors. These are the right singular vectors $r_i$ of $A$, and the corresponding eigenvalues are the squares $\sigma_i^2$ of the singular values of $A$ (up to the appearance of some zeroes, which don't matter for this calculation). If we write $x$ in this basis as

$$x = \sum x_i r_i$$

we get that

$$\langle Ax, Ax \rangle = \sum \sigma_i^2 x_i^2$$

where $\langle x, x \rangle = \sum x_i^2 = 1$. This is a much easier optimization problem! It follows that $\langle Ax, Ax \rangle$ is maximized when $x$ is equal to a right singular vector corresponding to the largest singular value $\sigma_1$, and that its maximum value is $\sigma_1^2$. Hence $\sigma_1$ is the operator norm of $A$. Note that if $A$ is normal it coincides with the absolute value of the largest eigenvalue (in absolute value) of $A$.

The largest singular value can be calculated in various ways. See the Wikipedia article on singular value decomposition for details.

0
On

The point is: in pure mathematics you mostly don't care about actually calculating things ;)

Of course this is (partly) a joke, but the answer is really that this norm has a lot of nice properties (see functional analysis), and allows you to prove theorems which will make other calculations a lot easier. The trouble is: if you want to define things so that they actually work for theorems (satisfy some nice universal property and so on), they are often quite difficult (if not impossible) to compute in practice. But after all, computing is not all that you want, if theorems manage to make everything a lot clearer.

4
On

An operator norm is better than just a norm, it is an algebra norm (not sure if it is the right term, in French it is). The point is that the norm satisfies: $$\forall A,B,\|AB\|\leqslant\|A\|\times\|B\|.$$

0
On

The operator norm of a square matrix $A$ is the square root of the magnitude of the largest eigenvalue of $A^T A$. To see this, first note that $$\|A^T A\| = \sup\limits_{\|x\|=1} \| A^T A x \| = \sup\limits_{\|x\|=1} \sup\limits_{\|y\|=1} \langle A^T A x , y \rangle = \sup\limits_{\|x\|=1} \sup\limits_{\|y\|=1} \langle Ax , Ay \rangle$$ $$= \sup\limits_{\|x\|=1} \|Ax\|^2.$$ Thus $\|A^T A\| = \|A\|^2$. Now, using the fact that a symmetric matrix has an orthornomal basis of eigenvectors, its not hard to show that for a symmetric matrix $S$, $\|S\|$ is the (absolute value of) the largest eigenvalue of $S$. Since $A^T A$ is symmetric, this completes the proof.

0
On

If you switch your vector norm to the 1-norm or the $\infty$-norm, then it turns out that the matrix norm is quite easy to compute. Since convergence in any of these norms implies convergence in the others, we frequently switch to the norm that is easiest to use for a particular problem.

0
On

It depends on the norm which you take to begin with. Some matrix norms are hard to compute, others are not. In your example, for $p=2$, the norm of the matrix $A\in \mathbb{K}^{n\times n}$ is the square root of the maximal eigenvalue of $A^* A$. This computation is not too hard, even in large dimensions, as this is a Hermitian, resp. symmetric eigenvalues problem.

Similarly, for $p=1$ and $p=\infty$ the matrix norm has simple expressions, as the column sum, resp. row sum norm.

The technical reason why operator norms are great has been pointed out in previous answers. Submultiplicativity is very handy for many types of estimates. For instance, you get that $\|A\| \geq r(A)$, where $r(A)$ is the spectral radius of $\|A\|$. And on top of that the famous Gelfand formula $$ r(A) = \lim_{k\to \infty} \|A^k\|^{1/k} = \inf_{k\geq 1} \|A^k\|^{1/k},$$ which even holds for bounded linear operators in Banach spaces.