What is the intuitive interpretation of the transpose compared to the inverse?

2.4k Views Asked by At

I've been thinking about this question already for a long time and I've just encountered it again in the following lemma:

$$f(x) = g(Ax + b) \implies \nabla f = A^T \nabla g(Ax + b) $$

This lemma makes intuitive sense if you think of it as taking the $x$ to the space $Ax$, calculating the gradient and then taking the result back to the original space. But why is "taking the result back" realised as $A^T$ and not $A^{-1}$?

By doing the calculations you get $A^T$, no doubt, but I always expect an inverse. In general, when should I expect a transpose and when an inverse? Where are they similar and where do they differ?

5

There are 5 best solutions below

3
On BEST ANSWER

We usually see matrices as linear transformations. The inverse of $A$, when it exists, means simply "reversing" what $A$ does as a function. The transpose originates in a different point of view.

So we have vector spaces $X,Y$, and $A:X\to Y$ is linear. For many reasons, we often look at the linear functionals on the space; that way we get the dual $$ X^*=\{f:X\to\mathbb R:\ f\ \text{ is linear}\}, $$ and correspondingly $Y^*$. Now the map $A$ induces a natural map $A^*:Y^*\to X^*$, by $$ (A^*g)(x)=g(Ax). $$ In the particular case where $X=\mathbb R^n$, $Y=\mathbb R^m$, one can check that $X^*=X$ and $Y^*=Y$, in the sense that all linear functionals $f:\mathbb R^n\to\mathbb R$ are of the form $f(x)=y^Tx$ for some fixed $y\in\mathbb R^n$. In this situation $A$ is an $m\times n$ matrix, and the matrix of $A^*$ is the transpose of $A$.

6
On

Something weird is going on here. I'm assuming $g: \mathbb R^m \to \mathbb R$ and say $A$ is an $m\times n$ matrix. Let $\mathcal a(x): \mathbb R^n \to \mathbb R^m, x \mapsto Ax + b$ be the corresponding affine transformation, so that $f = g \circ a$. The chain rule says $Df(x) = Dg(a(x)) Da(x)$.

The Jacobian realization of $Dg$ is $\nabla g$ and is an $1\times m$ matrix (row vector), while the Jacobian for $a$ is $A$, an $m \times n $ matrix. The dimensions all agree, since this would make $\nabla f$ a $1\times n$ matrix, which agrees with the notion that the derivative of $f$ is a linear map $\mathbb R^n \to \mathbb R$.

So what I suspect is happening is some identification of $\mathbb R^n$ with its dual space under the Euclidean inner product; that is, you're realizing the gradient as a column vector instead of a row vector. The transpose is precisely the way this is done. If $T: V \to W$ is a linear transformation, then its adjoint is $T^\dagger: W^* \to V^*$. But under the Euclidean inner product, you can identify $\mathbb R^n \cong (\mathbb R^n)^*$, so $$ (\nabla g(a(x)) A)^T = A^T [\nabla g(a(x))]^T = A^T \nabla g(a(x))$$ where we're abusing notation by identifying the row vector $\nabla g$ with the column vector $\nabla g$. This hidden identification is likely what is confusing you.

0
On

Notice using the chain rule that $$D_p g(Av+b)=\langle\nabla g(Ap+b),Av\rangle=\langle A^T\nabla g(Ap+b),v\rangle.$$ Now compare to $D_pf(v)=\langle\nabla f(p),v\rangle$.

0
On

Here you are not "taking the result back to the original space", you are chaining transforms.

If you think of a linear transform applied to a vector, it's a bunch of dot products, of the rows of the array by the column vector and

$$\vec x\cdot\vec y\equiv x^Ty.$$

2
On

Taking the directional derivative of $f (\mathrm x) := g (\mathrm A \mathrm x + \mathrm b)$ in the direction of $\rm v$ at $\rm x$,

$$\lim_{h \to 0} \frac{f (\mathrm x + h \mathrm v) - f (\mathrm x)}{h} = \langle \nabla g (\mathrm A \mathrm x + \mathrm b), \mathrm A \mathrm v \rangle = \langle \mathrm A \mathrm v, \nabla g (\mathrm A \mathrm x + \mathrm b) \rangle = \langle \mathrm v, \mathrm A^\top \nabla g (\mathrm A \mathrm x + \mathrm b) \rangle$$

and, thus, the gradient of $f$ is

$$\nabla f (\mathrm x) = \mathrm A^\top \nabla g (\mathrm A \mathrm x + \mathrm b)$$