Derivative of $x^T A(x) x$

97 Views Asked by At

How would the following derivative be expressed:

$$\frac{d}{d{\mathbf{x}}}\left(\mathbf{x}^T A(\mathbf{x}) \mathbf{x}\right)$$.

$\mathbf{x}\in \mathbb{R}^n$ is a vector, and $A\in \mathbb{R}^{n\times n}$ is a square matrix which is symmetric. Each element of $A$ is a function of $\mathbf{x}$ I am having trouble visualizing the result of this general case.

EDIT

I am going about this very naively following a comment below.

$$\frac{d}{d{\mathbf{x}}}\left(\mathbf{x}^T A(\mathbf{x}) \mathbf{x}\right) = \left(\frac{d}{d\mathrm{x}} \mathrm{x}^T\right)A(\mathrm{x}) \mathrm{x} + \mathrm{x}^T \frac{d}{d\mathrm{x}}(A(\mathrm{x})\mathrm{x}),$$

but then I end up with an inconsistent relationship involving one term that is a scalar:

$$\frac{d}{d{\mathbf{x}}}\left(\mathbf{x}^T A(\mathbf{x}) \mathbf{x}\right) = A(\mathrm{x}) \mathrm{x} + \mathrm{x}^T \frac{d}{d\mathrm{x}}(A(\mathrm{x}))\, \mathrm{x} + \mathrm{x}^T A(\mathrm{x})$$

where the second term in the left hand side is the scalar.

3

There are 3 best solutions below

1
On BEST ANSWER

You are correct to get to

$$ \begin{aligned}y & =\boldsymbol{x}^{\intercal}{\rm A}\boldsymbol{x}\\ \tfrac{{\rm d}}{{\rm d}\boldsymbol{x}}y & =\tfrac{{\rm d}}{{\rm d}\boldsymbol{x}}\left(\boldsymbol{x}^{\intercal}{\rm A}\boldsymbol{x}\right)\\ & =\tfrac{{\rm d}}{{\rm d}\boldsymbol{x}}\left(\boldsymbol{x}\right)^{\intercal}{\rm A}\boldsymbol{x}+\boldsymbol{x}^{\intercal}\tfrac{{\rm d}}{{\rm d}\boldsymbol{x}}\left({\rm A}\right)\boldsymbol{x}+\boldsymbol{x}^{\intercal}{\rm A}\tfrac{{\rm d}}{{\rm d}\boldsymbol{x}}\left(\boldsymbol{x}\right)\\ & =\left(\boldsymbol{x}^{\intercal}{\rm A}^{\intercal}\tfrac{{\rm d}}{{\rm d}\boldsymbol{x}}\left(\boldsymbol{x}\right)\right)^{\intercal}+\boldsymbol{x}^{\intercal}\tfrac{{\rm d}}{{\rm d}\boldsymbol{x}}\left({\rm A}\right)\boldsymbol{x}+\boldsymbol{x}^{\intercal}{\rm A}\tfrac{{\rm d}}{{\rm d}\boldsymbol{x}}\left(\boldsymbol{x}\right)\\ & =\left(\boldsymbol{x}^{\intercal}{\rm A}^{\intercal}\right)^{\intercal}+\boldsymbol{x}^{\intercal}\tfrac{{\rm d}}{{\rm d}\boldsymbol{x}}\left({\rm A}\right)\boldsymbol{x}+\boldsymbol{x}^{\intercal}{\rm A}\\ & ={\rm A}\boldsymbol{x}+\boldsymbol{x}^{\intercal}\tfrac{{\rm d}}{{\rm d}\boldsymbol{x}}\left({\rm A}\right)\boldsymbol{x}+\boldsymbol{x}^{\intercal}{\rm A} \end{aligned}$$

But you have to recognize that $\tfrac{{\rm d}}{{\rm d}\boldsymbol{x}}\left({\rm A}\right)$ is not a matrix, but a rank-3 tensor. So $\boldsymbol{x}^{\intercal}\tfrac{{\rm d}}{{\rm d}\boldsymbol{x}}\left({\rm A}\right)\boldsymbol{x}$ is not a scalar, but a vector.

To understand $\tfrac{{\rm d}}{{\rm d}\boldsymbol{x}}\left({\rm A}\right)$ consider the i-th column of ${\rm A}$ as a vector and calculate the jacobian matrix $${\rm J}_i = \frac{\partial{\rm A}_{i}}{\partial\boldsymbol{x}}$$

As a result, the i-th element of the vector $\boldsymbol{x}^{\intercal}\tfrac{{\rm d}}{{\rm d}\boldsymbol{x}}\left({\rm A}\right)\boldsymbol{x}$ is defined by the scalar $\boldsymbol{x}^{\intercal}\left(\frac{\partial{\rm A}_{i}}{\partial\boldsymbol{x}}\right)\boldsymbol{x}$

0
On

Just apply the matrix calculus. It follows from linearity and the symmetry of $A(x)$ that

\begin{align} d(x^\intercal A(x)x) &= dx^\intercal A(x)x + x^\intercal dA(x)x + x^\intercal A(x)dx \\ &= (A(x)x+A(x)^\intercal x)\cdot dx + x^\intercal (A’(x)dx)x \\ &= \left(2A(x)x + xx^\intercal : A’(x)\right)\cdot dx = \nabla\left(x^\intercal A(x)x\right)\cdot dx, \end{align}

which gives the desired derivative automatically through duality.

1
On

1. Fundamentally, for a function $f : \mathcal{X} \to \mathcal{Y}$ between normed spaces $\mathcal{X}$ and $\mathcal{Y}$, its derivative $A$ at $x \in \mathcal{X}$ is the "best linear approximation" of $f$ about $x$. That is, $A$ is a (continuous) linear map from $\mathcal{X}$ to $\mathcal{Y}$ so that

$$ f(x + h) = f(x) + Ah + o(\|h\|_V) $$

as $ h \to 0$. In addition, if we fix some basis of $\mathcal{X}$ and $\mathcal{Y}$, then $A$ can be represented by a matrix.

Example. In calculus and other areas, we often identify a linear map $L : \mathbb{R}^n \to \mathbb{R}$ with a vector $\ell \in \mathbb{R}^n$ via the relation $ Lx = \ell^{\top}x$. If this identification is applied to the derivative $Df$ of a function $f : \mathbb{R}^n \to \mathbb{R}$, then the resulting vector is called the gradient of $f$ and is denoted by $\nabla f$. That is, $ (Df)_x h = (\nabla f(x))^{\top} h $.

2. The above fundamental idea helps us identify the correct form of the "matrix representation" of a derivative. For example, if $x \mapsto A(x)$ is a symmetric-matrix-valued $C^1$-function, then

$$ f(x) = x^{\top}A(x)x $$

defines a $C^1$-function $f : \mathbb{R}^n \to \mathbb{R}$. Hence, its gradient $\nabla f$ at $x$ can be found by identifying the vector $\ell \in \mathbb{R}^n$ satisfying

$$ f(x + h) = f(x) + \ell^{\top}h + o(\|h\|) $$

as $h \to 0$. Indeed, writing $\Delta A = A(x + h) - A(x)$ and noting that $\Delta A = o(1)$ as $h \to 0$, we get

\begin{align*} f(x + h) - h(x) &= (x + h)^{\top} A(x + h) (x + h) - x^{\top} A(x) x \\ &= (x + h)^{\top} (A(x) + \Delta A) (x + h) - x^{\top} A(x) x \\ &= h^{\top} A(x) x + x^{\top} \Delta A x + x^{\top} A(x) h + o(h). \end{align*}

So it remains to convert the last expression into the form $\ell^{\top}h + o(h)$ for some vector $\ell \in \mathbb{R}^n$. This can be done by introducing the vector $B(x) = (b_1(x), \ldots, b_n(x))^{\top}$ where

$$ b_k(x) = \sum_{i,j=1}^{n} x_i x_j \frac{\partial A_{ij}(x)}{\partial x_k}. $$

Then $\Delta A = B(x)^{\top} h + o(h)$. With this and the identity $h^{\top} A(x) x = x^{\top} A(x) h$ together,

\begin{align*} f(x + h) - h(x) &= (B(x)^{\top} + 2x^{\top}A(x)) h + o(h) \\ &= (B(x) + 2A(x)x)^{\top} h + o(h). \end{align*}

Therefore it follows that

$$ \nabla f(x) = B(x) + 2A(x)x. $$

I see no way of further simplying $B(x)$, except for invoking tensor operations. So this is probably the best you can hope for.