How to calculate the gradient of $\mathbf{x}^T\mathbf{W}^2(\mathbf{W}^2)^T\mathbf{x}$ w.r.t. $\mathbf{W}$?

85 Views Asked by At

I need to calculate the gradient of $\mathbf{x}^T\mathbf{W}^2(\mathbf{W}^2)^T\mathbf{x}$ w.r.t. $\mathbf{W}$. Here is what I have tried. Let $A=W^2$, then the form reduces to \begin{align*} \frac{\partial \mathbf{x}^T\mathbf{AA}^T\mathbf{x}}{\partial \mathbf{A}} =&\frac{\partial \mathbf{x}^T\mathbf{B}^T\mathbf{B}\mathbf{x}}{\partial \mathbf{B}^T} \quad\quad (\text{where }\mathbf{A}=\mathbf{B}^T) \\ =&\left(\frac{\partial \mathbf{x}^T\mathbf{B}^T\mathbf{B}\mathbf{x}}{\partial \mathbf{B}}\right)^T \\ =&\left(\mathbf{B}(\mathbf{x}\mathbf{x}^T+\mathbf{x}\mathbf{x}^T)\right)^T\\ =&2\mathbf{x}\mathbf{x}^T\mathbf{B}^T\\ =&2\mathbf{x}\mathbf{x}^T\mathbf{A} \end{align*} which follows from the formula (77) in Matrix Cookbook, specifically, $$ \frac{\partial \mathbf{b}^T\mathbf{X}^T\mathbf{X}\mathbf{c}}{\partial\mathbf{X}}=\mathbf{X}(\mathbf{b}\mathbf{c}^T+\mathbf{c}\mathbf{b}^T). $$ I was trying to use the chain rule since we already know $$ \frac{\partial \mathbf{x}^T\mathbf{W}^2(\mathbf{W}^2)^T\mathbf{x}}{\partial \mathbf{W}^2}=2\mathbf{x}\mathbf{x}^T\mathbf{W}^2. $$ The next step is supposed to be $$ \frac{\partial \mathbf{x}^T\mathbf{W}^2(\mathbf{W}^2)^T\mathbf{x}}{\partial \mathbf{W}}=\frac{\partial\mathbf{W}^2}{\partial\mathbf{W}}\frac{\partial x^T\mathbf{W}^2(\mathbf{W}^2)^Tx}{\partial \mathbf{W}^2} $$ where we use the denominator layout. However, the dimension of $ \frac{\partial\mathbf{W}^2}{\partial\mathbf{W}} $ does not fit the dimension of $\frac{\partial \mathbf{x}^T\mathbf{W}^2(\mathbf{W}^2)^T\mathbf{x}}{\partial \mathbf{W}^2}$, namely $2\mathbf{x}\mathbf{x}^T\mathbf{W}^2$. Does anyone give me any clues? I appreciate it.

Update: I tried to use Frobenius inner product to do it as follows. Let $z=\mathbf{x}^T\mathbf{W}^2(\mathbf{W}^2)^T\mathbf{x}$, then we have \begin{align*} \mathrm{d}z&=2\mathbf{x}\mathbf{x}^T\mathbf{A}:\mathrm{d}\mathbf{A} \\ &=2\mathbf{x}\mathbf{x}^T\mathbf{A}:\mathrm{d}\mathbf{W}^2\\ &=2\mathbf{x}\mathbf{x}^T\mathbf{A}:(\mathrm{d}\mathbf{W}\mathbf{W}+\mathbf{W}\mathrm{d}\mathbf{W})\\ &=2\mathbf{x}\mathbf{x}^T\mathbf{A}:\mathrm{d}\mathbf{W}\mathbf{W}+2\mathbf{x}\mathbf{x}^T\mathbf{A}:\mathbf{W}\mathrm{d}\mathbf{W}\\ &=2\mathbf{x}\mathbf{x}^T\mathbf{A}\mathbf{W}^T:\mathrm{d}\mathbf{W}+2\mathbf{W}^T\mathbf{x}\mathbf{x}^T\mathbf{A}:\mathrm{d}\mathbf{W}\\ &=2(\mathbf{x}\mathbf{x}^T\mathbf{A}\mathbf{W}^T+\mathbf{W}^T\mathbf{x}\mathbf{x}^T\mathbf{A}):\mathrm{d}\mathbf{W} \end{align*} which gives the solution $$ \frac{\partial \mathbf{x}^T\mathbf{W}^2(\mathbf{W}^2)^T\mathbf{x}}{\partial \mathbf{W}}=2(\mathbf{x}\mathbf{x}^T\mathbf{A}\mathbf{W}^T+\mathbf{W}^T\mathbf{x}\mathbf{x}^T\mathbf{A}). $$ I tested this result with a $2\times 2$ $\mathbf{W}$ using the auto-differentiation tool by PyTorch which gives an identical result. This implies that the above derivations are correct.

1

There are 1 best solutions below

0
On BEST ANSWER

Let $\mathbf{y}=(\mathbf{W}^2)^T \mathbf{x}$. It holds \begin{eqnarray*} d\phi &=& 2 \mathbf{y}:d\mathbf{y} \\ &=& 2 \mathbf{y}\mathbf{x}^T:d(\mathbf{W}^2)^T \\ &=& 2 \mathbf{B}:d(\mathbf{W}^2) \\ &=& 2 \left[\mathbf{W}^T\mathbf{B}+ \mathbf{B} \mathbf{W}^T \right] :d\mathbf{W} \end{eqnarray*} where $\mathbf{B} =\mathbf{x}\mathbf{y}^T =\mathbf{x}\mathbf{x}^T\mathbf{W}^2$. The LHS term is the gradient you found by yourself.