I'm trying to prove that when $f(x) =x^TBx$, then $f'(x) = (B + B^T)x$. I haven't found this formula online but going through the calculations using index notation this is what I came up with. This would simplify to $2Bx$ when $B$ is symmetric. The accepted response to this discussion says that the solution is actually $f'(x) = x^T(B + B^T)$, going through the proof there, I see how he got there but I can't see where the mistake is in mine then.
The setup
- $x \in \mathbb{R^n}$, it is always a column vector
- $B \in \mathbb{R^{n \times n}}$, $B$ may not be symmetric
My approach
Let $g(x)=x^TB$ and $h(x)=x$, then I can write $f(x)=g(x)h(x)$. Then
- $f(x) \in \mathbb{R}$
- $g(x) \in \mathbb{R^{1 \times n}}$
- $h(x) \in \mathbb{R^n}$
- $f'(x) \in \mathbb{R^n}$
- $g'(x) \in \mathbb{R^{n \times n}}$
- $h'(x) \in \mathbb{R^{n \times n}}$
I've gone through myself why $g'(x) = B$ and $h'(x) = I_n$, so I won't go through those here.
Then, using the product rule I get:
$$f'(x) = g'(x)h(x) + g(x)h'(x)$$
The problem is that the dimensions don't add up. I get $g'(x)h(x) = Bx \in \mathbb{R^{n}}$, which is good. However, I also have $g(x)h'(x) = x^TBI_n = x^TB \in \mathbb{R^{1 \times n}}$ and as far as I know I can't add up two vectors of different sizes.
I know that the solution is going to be the transpose of second term, I just can't seem to find where that transpose would come from.
Why do I need to take the transpose of the second term?
[Edit]: Please don't reply with a different proof. What I'm looking for is to understand where I made the mistake in my calculation because obviously I made a step which was incorrect and without understanding where that is I'm likely to make that mistake again.
It may be useful to introduce the Frobenius inner product as:
$$ A:B = \operatorname{tr}(A^TB)$$
with the following properties derivied from the underlying trace function
$$\eqalign{A:BC &= B^TA:C\cr &= AC^T:B\cr &= A^T:(BC)^T\cr &= BC:A \cr } $$
Then we work with differentials to find the gradient. The product rule works as you expect. At each side of the colons, you can note that dimensions are consistent.
$$\eqalign{ f&= x^TBx\\ &= x : Bx\\ df &= dx : Bx + x : Bdx\\ &=dx : Bx + B^Tx:dx\\ &=(B + B^T)x : dx }$$
Thus the gradient is: $$\frac{\partial f}{\partial x} =(B + B^T)x$$
edit:
The issue is that for vector terms: $$\frac{\partial(u^Tv)}{\partial x} \ne \Bigg(\frac{\partial u^T}{\partial x}\Bigg)v + u^T\Bigg(\frac{\partial v}{\partial x}\Bigg)$$
When working with differentials, on the other hand, it holds:
$$ d(A\star B) = dA\star B + A\star dB $$
where $\star$ can be Frobenius, Kronecker, Hadamard, matrix product, etc.
If you work the differential with the matrix product form you will see that a term $dx^T$ appears. Dealing with this term for grouping dx is what causes your missing transpose to appear.
If you want to directly apply a product rule it should read:
$$\frac{\partial(u \cdot v)}{\partial x} = \Bigg(\frac{\partial u}{\partial x}\Bigg)^T v + \Bigg(\frac{\partial v}{\partial x}\Bigg)^T u$$
where $u \cdot v = u^Tv$, with $u=x$ and $v = Bx$.