Proof of derivative of $x^TBx$ using the product rule

2.7k Views Asked by At

I'm trying to prove that when $f(x) =x^TBx$, then $f'(x) = (B + B^T)x$. I haven't found this formula online but going through the calculations using index notation this is what I came up with. This would simplify to $2Bx$ when $B$ is symmetric. The accepted response to this discussion says that the solution is actually $f'(x) = x^T(B + B^T)$, going through the proof there, I see how he got there but I can't see where the mistake is in mine then.

The setup

  • $x \in \mathbb{R^n}$, it is always a column vector
  • $B \in \mathbb{R^{n \times n}}$, $B$ may not be symmetric

My approach

Let $g(x)=x^TB$ and $h(x)=x$, then I can write $f(x)=g(x)h(x)$. Then

  • $f(x) \in \mathbb{R}$
  • $g(x) \in \mathbb{R^{1 \times n}}$
  • $h(x) \in \mathbb{R^n}$
  • $f'(x) \in \mathbb{R^n}$
  • $g'(x) \in \mathbb{R^{n \times n}}$
  • $h'(x) \in \mathbb{R^{n \times n}}$

I've gone through myself why $g'(x) = B$ and $h'(x) = I_n$, so I won't go through those here.

Then, using the product rule I get:

$$f'(x) = g'(x)h(x) + g(x)h'(x)$$

The problem is that the dimensions don't add up. I get $g'(x)h(x) = Bx \in \mathbb{R^{n}}$, which is good. However, I also have $g(x)h'(x) = x^TBI_n = x^TB \in \mathbb{R^{1 \times n}}$ and as far as I know I can't add up two vectors of different sizes.

I know that the solution is going to be the transpose of second term, I just can't seem to find where that transpose would come from.

Why do I need to take the transpose of the second term?

[Edit]: Please don't reply with a different proof. What I'm looking for is to understand where I made the mistake in my calculation because obviously I made a step which was incorrect and without understanding where that is I'm likely to make that mistake again.

3

There are 3 best solutions below

3
On BEST ANSWER

It may be useful to introduce the Frobenius inner product as:

$$ A:B = \operatorname{tr}(A^TB)$$

with the following properties derivied from the underlying trace function

$$\eqalign{A:BC &= B^TA:C\cr &= AC^T:B\cr &= A^T:(BC)^T\cr &= BC:A \cr } $$

Then we work with differentials to find the gradient. The product rule works as you expect. At each side of the colons, you can note that dimensions are consistent.

$$\eqalign{ f&= x^TBx\\ &= x : Bx\\ df &= dx : Bx + x : Bdx\\ &=dx : Bx + B^Tx:dx\\ &=(B + B^T)x : dx }$$

Thus the gradient is: $$\frac{\partial f}{\partial x} =(B + B^T)x$$

edit:

The issue is that for vector terms: $$\frac{\partial(u^Tv)}{\partial x} \ne \Bigg(\frac{\partial u^T}{\partial x}\Bigg)v + u^T\Bigg(\frac{\partial v}{\partial x}\Bigg)$$

When working with differentials, on the other hand, it holds:

$$ d(A\star B) = dA\star B + A\star dB $$

where $\star$ can be Frobenius, Kronecker, Hadamard, matrix product, etc.

If you work the differential with the matrix product form you will see that a term $dx^T$ appears. Dealing with this term for grouping dx is what causes your missing transpose to appear.

If you want to directly apply a product rule it should read:

$$\frac{\partial(u \cdot v)}{\partial x} = \Bigg(\frac{\partial u}{\partial x}\Bigg)^T v + \Bigg(\frac{\partial v}{\partial x}\Bigg)^T u$$

where $u \cdot v = u^Tv$, with $u=x$ and $v = Bx$.

1
On

The easiest way is really to right the function as a scalar :

$$f(x)=x^TBx=\sum_{i=1}^n\sum_{j=1}^n x_ib_{ij}x_j.$$ Then $$\partial _kf(x)=\sum_{i=1}^n x_ib_{ik}+\sum_{i=1}^n b_{ki}x_i=\sum_{i=1}^n(b_{ik}+b_{ki})x_i.$$ Therefore, $$f'(x)=x^T(B+B^T)=(B^T+B)x.$$

3
On

The way you try to define derivatives with respect to $x$ has a subtle inconsistency. On the one hand you insist the derivative of $x^TB$ is $B$, implying differentiation's effect is to cancel an $X^T$ from the left. On the other hand, you insist the derivative of $X$ (i.e. $IX$, not $X^TI=X^T$) is $I$, i.e. differentiation cancels an $X$ from the right. It's better to work through explicit indices.

Whether the derivative of the scalar $x^TBx$ is the column vector $(B+B^T)x$ or the transpose thereof, the row vector $x^T(B+B^T)$, is a matter of convention. (However, a column-vector convention has one obvious advantage: the chain rule becomes $df=(dx)^T\nabla f$.) What is not a matter of convention is its $i$th component is $$\partial_i (x_jB_{jk}x_k)=\delta_{ij}B_{jk}x_k+x_jB_{jk}\delta_{ik}=B_{ik}x_k+B_{ji}x_j=[(B+B^T)x]_i.$$