Do you have any idea on this scalar-by-matrix derivative?

57 Views Asked by At

I'm trying to find out the derivative

$$\frac { d({ x }^{ T }W{ W }^{ T }x) }{ dW }$$

where $x$ is $n \times 1$ and $W$ is $n \times m$, using

$$\frac { d({ x }^{ T }Wx) }{ dW } = {x}^{T} x$$

but I don't know how. Do you have any idea on how to compute this derivative?

2

There are 2 best solutions below

0
On BEST ANSWER

If $x \in \mathbb{R}^n$ is fixed, let $f$ be the real-valued function such that:

$$ \forall x \in \mathrm{Mat}(n,\mathbb{R}), \; f(x) = x^{\top}WW^{\top}x. $$

The gradient of $f$ with respect to $W$, denoted by $\nabla f(W)$ is the matrix in $\mathrm{Mat}(n,\mathbb{R})$ such that:

$$ \forall H \in \mathrm{Mat}(n,\mathbb{R}), \; f(W+H) = f(W) + \left\langle \nabla f(W), H \right\rangle + o(\Vert H \Vert) $$

where $\left\langle \cdot, \cdot \right\rangle$ (resp. $\Vert \cdot \Vert)$ denotes the canonical inner product (resp. norm induced by the inner product) on $\mathrm{Mat}(n,\mathbb{R})$. That is: $\left\langle M,N \right\rangle = \mathrm{tr}(M^{\top}N)$ for any $M,N \in \mathrm{Mat}(n,\mathbb{R})$.

Here:

$$ \begin{align*} f(W+H) & = {} x^{\top}(W+H)(W+H)^{\top}x \\[2mm] & = x^{\top} \big( W W^{\top} + W H^{\top} + H W^{\top} + H H^{\top} \big) x \\[2mm] & = f(W) + x^{\top} W H^{\top} x + x^{\top} H W^{\top} x + o(\Vert H \Vert). \end{align*} $$

Note that:

$$ \begin{align*} x^{\top} W H^{\top} x + x^{\top} H W^{\top} x & = \mathrm{tr}\big( x^{\top} W H^{\top} x + x^{\top} H W^{\top} x \big) \\[2mm] & = \mathrm{tr}\big( x x^{\top} W H^{\top} + x x^{\top} W H^{\top} \big) \\[2mm] & = \left\langle 2 x x^{\top} W, H \right\rangle. \end{align*} $$

By identification:

$$ \nabla f(W) = 2 x x^{\top} W. $$


Let $\varphi$ be the map defined on $\mathrm{Mat}(n,\mathbb{R})$ by:

$$ \forall M \in \mathrm{Mat}(n,\mathbb{R}), \; \varphi(M) = x^{\top} M x. $$

You can obtain the gradient of $f$ usign the chain rule. Define $\psi$ on $\mathrm{Mat}(n,\mathbb{R})$ by:

$$ \forall W \in \mathrm{Mat}(n,\mathbb{R}), \; \psi(W) = W W^{\top}. $$

It follows that $f = \varphi \circ \psi.$ The chain rule gives:

$$ \mathrm{D}_{W}(\varphi \circ \psi) \cdot H = \mathrm{D}_{\psi(W)}\varphi \cdot \big( \mathrm{D}_{W}\psi \cdot H \big). $$

But because $\varphi$ is linear, $\mathrm{D}_{W}\varphi = \varphi$ for all $W$. Therefore:

$$ \begin{align*} \mathrm{D}_{W}(\varphi \circ \psi) \cdot H & = {} \varphi\big( \mathrm{D}_{W}\psi \cdot H \big) \\[2mm] & = x^{\top} \big( W H^{\top} + H W^{\top} \big) x. \end{align*} $$

Given that $\mathrm{D}_{W}(\varphi \circ \psi) \cdot H = \left\langle \nabla f(W), H \right\rangle$, we have again: $\nabla f(W) = 2 x x^{\top}W.$

0
On

Using a colon to denote the trace product, i.e. $$A:B={\rm tr}(A^TB)$$ you can jot down the function, differential, and gradient $$\eqalign{ \phi &= x^TWW^Tx = x^TW:x^TW \cr d\phi &= 2x^TW:x^TdW = 2xx^TW:dW \cr \frac{\partial\phi}{\partial W} &= 2xx^TW \cr }$$