How to calculate the gradient of $x^T A x$?

93.3k Views Asked by At

I am watching the following video lecture:

https://www.youtube.com/watch?v=G_p4QJrjdOw

In there, he talks about calculating gradient of $ x^{T}Ax $ and he does that using the concept of exterior derivative. The proof goes as follows:

  1. $ y = x^{T}Ax$
  2. $ dy = dx^{T}Ax + x^{T}Adx = x^{T}(A+A^{T})dx$ (using trace property of matrices)
  3. $ dy = (\nabla y)^{T} dx $ and because the rule is true for all $dx$
  4. $ \nabla y = x^{T}(A+A^{T})$

It seems that in step 2, some form of product rule for differentials is applied. I am familiar with product rule for single variable calculus, but I am not understanding how product rule was applied to a multi-variate function expressed in matrix form.

It would be great if somebody could point me to a mathematical theorem that allows Step 2 in the above proof.

Thanks! Ajay

5

There are 5 best solutions below

4
On BEST ANSWER

\begin{align*} dy & = d(x^{T}Ax) = d(Ax\cdot x) = d\left(\sum_{i=1}^{n}(Ax)_{i}x_{i}\right) \\ & = d \left(\sum_{i=1}^{n}\sum_{j=1}^{n}a_{i,j}x_{j}x_{i}\right) =\sum_{i=1}^{n}\sum_{j=1}^{n}a_{i,j}x_{i}dx_{j}+\sum_{i=1}^{n}\sum_{j=1}^{n}a_{i,j}x_{j}dx_{i} \\ & =\sum_{i=1}^{n}(Ax)dx_{i}+\sum_{i=1}^{n}(Adx)x_{i} =(dx)^{T}Ax+x^{T}Adx \\ & =(dx)^{T}Ax+(dx)^{T}A^{T}x =(dx)^{T}(A+A^{T})x. \end{align*}

0
On

Step 2 might be the result of a simple computation. Consider $u(x)=x^TAx$, then $$ u(x+h)=(x+h)^TA(x+h)=x^TAx+h^TAx+x^TAh+h^TAh, $$ that is, $u(x+h)=u(x)+x^T(A+A^T)h+r_x(h)$ where $r_x(h)=h^TAh$ (this uses the fact that $h^TAx=x^TA^Th$, which holds because $m=h^TAx$ is a $1\times1$ matrix hence $m^T=m$).

One sees that $r_x(h)=o(\|h\|)$ when $h\to0$. This proves that the differential of $u$ at $x$ is the linear function $\nabla u(x):\mathbb R^n\to\mathbb R$, $h\mapsto x^T(A+A^T)h$, which can be identified with the unique vector $z$ such that $\nabla u(x)(h)=z^Th$ for every $h$ in $\mathbb R^n$, that is, $z=(A+A^T)x$.

0
On

Here's a method which calculates the gradient of $x^TAx$ without using the exterior derivative. I know that this is not what you are after, but it is worth noting how to prove it without the exterior derivative. This also allows for comparison with the exterior derivative method to see how much easier it is.

Let $A$ be $n\times n$, $A = [a_{ij}]$. If $x \in \mathbb{R}^n$, $x = (x_1, \dots, x_n)^T$, then $y = \displaystyle\sum_{i=1}^n\sum_{j=1}^na_{ij}x_ix_j$.

Then we have

\begin{align*} \dfrac{\partial y}{\partial x_k} &= \sum_{i\neq k}\dfrac{\partial}{\partial x_k}\left(\sum_{j=1}^na_{ij}x_ix_j\right) + \dfrac{\partial}{\partial x_k}\left(\sum_{j=1}^na_{kj}x_kx_j\right)\\ &=\sum_{i\neq k}\left(\dfrac{\partial}{\partial x_k}\left(\sum_{j\neq k}a_{ij}x_ix_j\right) + \dfrac{\partial}{\partial x_k}(a_{ik}x_ix_k)\right) + \sum_{j\neq k}\dfrac{\partial}{\partial x_k}(a_{kj}x_kx_j) + \dfrac{\partial}{\partial x_k}(a_{kk}x_k^2)\\ &= \sum_{i\neq k}a_{ik}x_i + \sum_{j\neq k}a_{kj}x_j + 2a_{kk}x_k\\ &= \sum_{i = 1}^na_{ik}x_i + \sum_{j=1}^na_{kj}x_j\\ &= (x^TA)_k + (Ax)_k \end{align*}

where $(x^TA)_k$ is the $k^{\text{th}}$ component of the row vector $x^TA$ and $(Ax)_k$ is the $k^{\text{th}}$ component of the column vector $Ax$. By taking the transpose of $Ax$ we obtain the row vector $x^TA^T$ which has the same $k^{\text{th}}$ component as $Ax$ does. Therefore $\dfrac{\partial y}{\partial x_k} = (x^TA)_k + (x^TA^T)_k$. Therefore

$$\nabla y = x^TA + x^TA^T = x^T(A + A^T).$$

5
On

Another approach is to use a multivariable product rule. Suppose $g$ and $h$ are differentiable functions from $\mathbb R^n$ to $\mathbb R^m$, and $f(x) = \langle g(x), h(x) \rangle$ for all $x \in \mathbb R^n$. Then if $\Delta x \in \mathbb R^n$ is small we have \begin{align} f(x+\Delta x) &= \langle g(x+\Delta x), h(x+\Delta x) \rangle\\ &\approx \langle g(x) + g'(x) \Delta x, h(x) + h'(x) \Delta x \rangle \\ &= \langle g(x), h(x) \rangle + \langle h(x), g'(x) \Delta x \rangle + \langle g(x), h'(x) \Delta x \rangle + \text{small term} \\ &\approx f(x) + \langle g'(x)^T h(x),\Delta x \rangle + \langle h'(x)^T g(x), \Delta x \rangle \\ &= f(x) + \langle g'(x)^T h(x) + h'(x)^T g(x),\Delta x \rangle. \end{align} This suggests that \begin{equation} \nabla f(x) = g'(x)^T h(x) + h'(x)^T g(x). \end{equation} This is our multivariable product rule. (This derivation could be made into a rigorous proof by keeping track of error terms.)

In the case where $g(x) = x$ and $h(x) = Ax$, we see that \begin{align} \nabla f(x) &= Ax + A^T x \\ &= (A + A^T) x. \end{align}

(Edit) Explanation of notation:

Let $f:\mathbb R^n \to \mathbb R^m$ be differentiable at $x \in \mathbb R^n$. Then $f'(x)$ is the $m \times n$ matrix defined informally by \begin{equation} f(\underbrace{x}_{n \times 1} + \underbrace{\Delta x}_{n \times 1}) \approx \underbrace{f(x)}_{m \times 1} + \underbrace{f'(x)}_{m \times n} \underbrace{\Delta x}_{n \times 1}. \end{equation} (The approximation is good when $\Delta x$ is small.)

When $f:\mathbb R^n \to \mathbb R$, $f'(x)$ is a $1 \times n$ row vector. The gradient of $f$ at $x$, defined by $\nabla f(x) = f'(x)^T$, is an $n \times 1$ column vector. Notice that, in this case, \begin{equation} f(x + \Delta x) \approx f(x) + \underbrace{f'(x)}_{1 \times n} \underbrace{\Delta x}_{n \times 1} = f(x) + \langle \nabla f(x), \Delta x \rangle. \end{equation}

0
On

The exterior derivative has nothing to do here. How could a student understand such a proof ! "Did" gave a good answer.

The gradient $\nabla(f)$ of a function $f: E\rightarrow \mathbb{R}$ is defined, modulo a dot product $\langle \cdot, \cdot \rangle$ on the vector-space $E$, by the formula $$ \langle \nabla(f)(x), h \rangle = Df_x(h), $$ where $Df_x$ is the derivative of $f$ in $x$.

Example 1: Let $f:x\in\mathbb{R}^n \rightarrow x^TAx\in\mathbb{R}$. Then, $Df_x(h)=h^TAx+x^TAh=x^T(A+A^T)h$ (it's the derivative of a non-commutative product!); we consider the dot product $u.v=u^Tv$.

Thus, $Df_x(h)= \langle ((A+A^T)x), h \rangle $ and $\nabla(f)(x)=(A+A^T)x$, that is $\nabla(f)=A+A^T$.

Example 2: Let $f:X\in\mathcal{M}_n(\mathbb{R}) \rightarrow \operatorname{Trace}(X^TAX)\in\mathbb{R}$, where $\mathcal{M}_n(\mathbb{R})$ is the set of all $n \times n$ Matrices on $\mathbb{R}$.

Since Trace is a linear function, we have $$ Df_X(H) = \operatorname{Trace}(H^TAX+X^TAH) = \operatorname{Trace}(X^T(A+A^T)H); $$ we consider the dot product $\langle U,V \rangle = Trace(U^TV)$.

Thus, $Df_X(H)=\langle ((A+A^T)X), H \rangle$ and $\nabla(f)(X)=(A+A^T)X$, that is $\nabla(f)=(A+A^T)\otimes I$. (Kronecker product).

Example 3 (more difficult): Let $f:X\in\mathcal{M}_n(\mathbb{R}) \rightarrow \det(X)\in\mathbb{R}$.

The we have $$ Df_X(H) = \operatorname{Trace}(\operatorname{adjoint}(X)H) = \langle \operatorname{adjoint}(X)^T, H \rangle \quad \text{and} \quad \nabla(f)(X) = \operatorname{adjoint}(X)^T. $$

Example 4: Let $f:X\in\mathcal{M}_n(\mathbb{R}) \rightarrow X^TAX\in\mathcal{M}_n(\mathbb{R})$.

Then we have $Df_X(H)=H^TAX+X^TAH$. Here the gradient of $f$ does not exist. In a pinch, we can define $n^2$ gradients, the $\nabla(f_{i,j})$ (componentwise) but these functions have no geometric meanings.