Get wrong answer on $\frac{\partial \mathbf{x}^{\top} \mathbf{A} \mathbf{x}}{\partial \mathbf{x}}$ when using graph

172 Views Asked by At

I can use the product rule to obtain $\frac{\partial \mathbf{x}^{\top} \mathbf{A} \mathbf{x}}{\partial \mathbf{x}} = \mathbf{x}^{\top} \frac{\partial \mathbf{A} \mathbf{x}}{\partial \mathbf{x}}+(\mathbf{A} \mathbf{x})^{\top} \frac{\partial \mathbf{x}}{\partial \mathbf{x}}$, and get the right result $\mathbf{x}^{\top}\left(\mathbf{A}+\mathbf{A}^{\top}\right)$. Here $x$ is a vector, $A$ is a matrix that is not a function of $x$.

However, when I try to calculate by using a graph, the result is different. I am not sure at which step I made the mistake. Please point out where I made it wrong. I used the graph attached (red numbers are the derivatives) and the equations as follows.

the graph

Let $ q_{2}=q_{1}x$, $q_{1}=(A^{T} x)^{T}$, and $k=A^Tx$. I want to find $\frac{d q_{2}}{d x}$

$\frac{d q_{2}}{d x}=\frac{\partial q_{2}}{\partial x}+\frac{\partial q_{2}}{\partial q_{1}} \frac{\partial q_{1}}{\partial x}= q_{1}+x \frac{\partial q_{1}}{\partial x}= q_{1}+x \frac{\partial q_{1}}{\partial k} \frac{\partial k}{\partial x} =x^TA+xIA^T = x^TA+xA^T \ne x^{\top}\left({A}+{A}^{\top}\right)$, $I$ is the NxN identity matrix.

The result I got here is different. I know it is not correct, but I couldn't figure out where I lost it.

Thank you!

1

There are 1 best solutions below

6
On BEST ANSWER

In the line $$\frac{d q_{2}}{d x}=\frac{\partial q_{2}}{\partial x}+\frac{\partial q_{2}}{\partial q_{1}} \frac{\partial q_{1}}{\partial x}= q_{1}+x \frac{\partial q_{1}}{\partial x}=\dots$$ In terms of dimensions of matrices, it looks to me like you have written $$ [1\times n] = [1\times n] + [1\times n][n\times n] = [1\times n] + [n\times 1][n\times n] = \dots $$ because $x$ is a column vector. The product of matrices $[n\times 1][n\times n]$ is not defined, so there is an error here.

Let me try to rewrite in more explicit notation (I identify $\mathbb R^n$ with column vectors $\mathbb R^{n\times 1}$)

\begin{align} q&:\mathbb R^n \to \mathbb R,&q(x) &= x^TAx,\\ q_1&: \mathbb R^n \to \mathbb R^{1\times n}, & q_1(x) &= (A^Tx)^T = x^TA,\\ q_2&:\mathbb R^{1\times n} \times \mathbb R^n\to\mathbb R, & q_2(a,b) &= ab \end{align} Thus $$q(x) = q_2(q_1(x),x)$$ Since there are subtly different spaces involved here, I'll elect to use a notation that works for arbitrary spaces, namely the Frechet derivative $d$ which always satisfies the chain rule $$ d(f\circ g)(x)[h] = df(g(x))[dg(x)[h]],$$ and then the only game is to figure out what spaces each differential belongs to.

Letting $\partial$ denote a partial Frechet derivative, we have $$ dq(x)[h]= \partial_aq_2(q_1(x),x)[dq_1(x)[h]] + \partial_b q_2(q_1(x),x)[h].$$

Now, $q_1$ is linear, and $q_2$ is separately linear in each variable. For any linear map $L$, the Frechet derivative $dL(x)[h]=L(h)$. (The nice thing is that this covers multiplication of matrices on the left and on the right.) Thus for $h\in\mathbb R^n$ and $v\in\mathbb R^{1\times n}$,

\begin{align} dq_1(x)[h]&=q_1(h)=h^TA,\\ \partial_a q_2(a,b)[v]&= vb,\\ \partial_b q_2(a,b)[h]&= ah. \end{align} therefore \begin{align} dq(x)h&=dq_1(x)[h]x + q_1(x)h\\ &=h^T Ax + x^TA h \\ &= x^TA^Th +x^TAh \\ &= x^T(A^T+A)h = ((A^T+A)x)\cdot h \end{align} and finally we can read the Jacobian from the final line: $$ \frac {dq}{dx}(x) = x^T(A^T+A).$$ (And, the gradient vector is its transpose $\nabla q(x) = (A^T+A)x$.)

Equivalently, one has to understand that the Jacobian of a map $\mathbb R^n \to \mathbb R^{1\times n}$ acts on the right: if $g$ is $\mathbb R^{1\times n}$-valued, then $ f(x+h) \approx f(x) + \frac{df}{dx}(x)h $ cannot make sense by dimension counting (remember $h$ is a column vector). So we are forced to accept that the correct approximation is $f(x+h)\approx f(x) + h^T \frac{df}{dx}(x)$. What this means is that the order of terms in chain rule is reversed! Indeed, the standard chain rule formula for the composition of a $\mathbb R^{n\times 1}$-vector valued function $f=f(y)$ and $g=g(x)$ is $$ \frac{d(f\circ g)}{dx}(x) = \frac{df}{dy}(g(x))\frac{dg}{dx}(x)$$ If we take the matrix transpose of this equation, we obtain the chain rule for row-vector-valued functions $h=f^T$: $$\frac{d(h\circ g)}{dx}(x) = \left(\frac{dg}{dx}(x)\right)^T\frac{dh}{dy}(g(x)).$$

A similar issue occurs in the above where the input is the row vector, and the approximation for $q_2$ reads $ q_2(a+v,b)\approx q_2(a,b) + v \frac{\partial q_2}{\partial a}(a,b)$. Hence (and I think deriving from first principles is easiest) \begin{align} q_2(q_1(x+h),b)&\approx q_2(q_1(x),b) + (q_1(x+h)-q_1(x)) \frac{\partial q_2}{\partial a}(q_1(x),b) \\&\approx q_2(q_1(x),b) + h^T\frac{dq_1}{dx}(x) \frac{\partial q_2}{\partial a}(q_1(x),b) \\& = q_2(a,b) + \frac{\partial q_2^T}{\partial a}(q_1(x),b)\frac{dq_1^T}{dx}(x)h \end{align} The correct chain rule is therefore, as painful as it seems, $$\frac{dq}{dx}(x)= \frac{\partial q_2^T}{\partial a}(q_1(x),x)\frac{dq_1^T}{dx}(x) + \frac{\partial q_2}{\partial b}(q_1(x),x)$$

It would be easier IMO to avoid this mess by not using a $\mathbb R^{1\times n}$-valued function.