Gradient of squared Frobenius norm of a complex matrix

106 Views Asked by At

Can anyone help to solve this problem?

$$ y = \left \| XWZ - X \right \|_{F}^{2} \qquad X \in \mathbb{R}^{a \times b}, \quad W \in \mathbb{R}^{b \times c}, \quad Z \in \mathbb{R}^{c \times b} $$

Gradient of $y$ w.r.t. $W$:

$$ \bigtriangledown_W y = ? $$

Gradient of $y$ w.r.t. $Z$:

$$ \bigtriangledown_Z y = ? $$

3

There are 3 best solutions below

6
On BEST ANSWER

Let's look at $y$ as the following function $$ \varphi: \mathbb{R}^{a \times b} \times \mathbb{R}^{b \times c} \times \mathbb{R}^{c \times b} \longrightarrow \mathbb{R} $$ whose transformation law is
$$ \varphi(X,W,Z) = \left \| XWZ - X \right \|_{F}^{2}= \text{trace}\Big((XWZ - X ) (XWZ - X )^{T} \Big) $$ After some algebraic manipulation you will get $$ \varphi(X, W, Z) = \text{trace}\Big(ZW^T(X^TX)WZ\Big) - 2\text{trace}\Big(ZW^T(X^TX)\Big) + \text{trace}\Big(X^TX\Big) $$

To compute the gradient of $\varphi$ with respect to the matrix $W$, we first differentiate $\varphi(X, W, Z)$ with respect to the entries of $W$. Recall that $\varphi(X, W, Z)$ is given by:

$$ \varphi(X, W, Z) = \text{trace}\Big(ZW^T(X^TX)WZ\Big) - 2\text{trace}\Big(ZW^T(X^TX)\Big) + \text{trace}\Big(X^TX\Big) $$

Let's differentiate each term with respect to $W$.

  1. For the first term, we use the trace properties of products of three and four matrices:

$$ \frac{\partial}{\partial W}\text{trace}\Big(ZW^T(X^TX)WZ\Big) = 2(X^TX)ZWZ^T $$

  1. For the second term, we use the trace property of products of two matrices:

$$ \frac{\partial}{\partial W}\text{trace}\Big(ZW^T(X^TX)\Big) = (X^TX)Z $$

  1. For the third term, since it does not depend on $W$, its derivative is zero:

$$ \frac{\partial}{\partial W}\text{trace}\Big(X^TX\Big) = 0 $$

Now, summing up the derivatives of the three terms, we get the gradient of $\varphi$ with respect to $W$:

$$ \nabla_W \varphi(X, W, Z) = 2(X^TX)ZWZ^T - 2(X^TX)Z $$

Thus, the gradient of $\varphi$ with respect to the matrix $W$ is given by $2(X^TX)ZWZ^T - 2(X^TX)Z$. In this derivation, we used the trace properties of products of two, three, and four matrices, which are crucial for simplifying the expressions and obtaining the gradient.

The calculation of the gradient of $\nabla_{Z}\varphi(X,W,Z)$ is entirely analogous. You can do it yourself.

0
On

Another answer, in terms of the definition of gradient in the direction of a certain component. I am only giving this extra answer because I think it might be more precise and elegant from a mathematical point of view

As in the previous answer let's look at y as the function $$ \varphi: \mathbb{R}^{a \times b} \times \mathbb{R}^{b \times c} \times \mathbb{R}^{c \times b} \longrightarrow \mathbb{R} $$ whose transformation law is
$$ \varphi(X,W,Z) = \left \| XWZ - X \right \|_{F}^{2}= \text{trace}\Big((XWZ - X ) (XWZ - X )^{T} \Big) $$ After some algebraic manipulation you will get $$ \varphi(X, W, Z) = \text{trace}\Big(ZW^T(X^TX)WZ\Big) - 2\text{trace}\Big(ZW^T(X^TX)\Big) + \text{trace}\Big(X^TX\Big) $$

The gradient of a function $\varphi$ with respect to a matrix $W$ can be defined as the unique linear transformation $\nabla_W\varphi(X, W, Z)$ from $\mathbb{R}^{b\times c}$ to $\mathbb{R}$ that satisfies:

$$ \varphi(X, W + H, Z) - \varphi(X, W, Z) = \nabla_W\varphi(X, W, Z) \cdot H + \|H\|_{F}\rho(H) $$

for some function $\rho: \mathbb{R}^{b\times c} \to \mathbb{R}$ such that $\rho(0) = 0$ and

$$ \lim_{H \to 0} \rho(H) = 0 $$

This definition states that the gradient of $\varphi$ at a point $(X, W, Z)$ is the linear transformation that best approximates the change in $\varphi$ as we move in the direction of $H$.

Using this definition, we can compute the gradient of $\varphi$ with respect to the matrix $W$. We already derived the gradient of $\varphi$ with respect to $W$ in a previous answer:

$$ \nabla_W \varphi(X, W, Z) = 2(X^TX)ZWZ^T - 2(X^TX)Z $$

The technique to obtain the gradient is to expand the expression $ \varphi(X, W + H, Z) - \varphi(X, W, Z)$ until you get an expression that is linear with respect to $H$ and quadratic (or any other non-linear expression) with respect to $H$.

\begin{align} \varphi(X, W + H, Z) - \varphi(X, W, Z) &= \left\| X(W+H)Z - X \right\|_{F}^{2} - \left\| XWZ - X \right\|_{F}^{2} \\ &= \text{trace}\Big( (X(W+H)Z - X)(X(W+H)Z - X)^T \Big) \\ &\phantom{=} - \text{trace}\Big( (XWZ - X)(XWZ - X)^T \Big) \\ &= \text{trace}\Big( (XWZ + XHZ - X)(XWZ + XHZ - X)^T \Big) \\ &\phantom{=} - \text{trace}\Big( (XWZ - X)(XWZ - X)^T \Big) \\ &= \text{trace}\Big( (XHZ)(XHZ)^T + 2(XHZ)(XWZ - X)^T \Big) \\ &\phantom{=} + \text{trace}\Big( (XWZ - X)(XWZ - X)^T \Big) \\ &\phantom{=} - \text{trace}\Big( (XWZ - X)(XWZ - X)^T \Big) \\ &= \underbrace{2\text{trace}\Big( (XHZ)(XWZ - X)^T \Big)}_{\nabla_W\varphi(X, W, Z) \cdot H} \\ &+\|H\|_{F}\underbrace{\text{trace}\left( (X\dfrac{H}{\|H\|_{F}}Z)(XHZ)^T \right)}_{\|H\|_{F}\rho(H)} \end{align}

Now use the cyclical invariance property of the trace states that the trace of the product of $k$ matrices remains unchanged when the matrices are permuted according to a cyclical permutation. Let $A_1, A_2, \ldots, A_k$ be a sequence of matrices that are compatible for multiplication, meaning the number of columns of matrix $A_i$ is equal to the number of rows of matrix $A_{i+1}$ for $i = 1, 2, \ldots, k-1$, and the number of columns of matrix $A_k$ is equal to the number of rows of matrix $A_1$. Then, the property can be stated as:

$$ \text{trace}(A_1 A_2 \ldots A_k) = \text{trace}(A_2 A_3 \ldots A_k A_1) = \text{trace}(A_3 A_4 \ldots A_k A_1 A_2) = \cdots = \text{trace}(A_k A_1 A_2 \ldots A_{k-1}). $$

The calculation of the gradient of $\nabla_{Z}\varphi(X,W,Z)$ is entirely analogous. You can do it yourself.

0
On

$$\eqalign{ \def\b{{y}} \def\p{\partial} \def\qiq{\quad\implies\quad} \def\qqq{\qquad\qquad\qquad} A:B &= {\rm Tr}(A^TB) \qqq \{ {\rm Frobenius\;product} \} \\ A &= XWZ-X \\ dA &= X\,(dW\,Z + W\,dZ) \qquad \{ {\rm differential} \} \\ \\ \b &= \|A\|^2_F = A:A \\ d\b &= 2A:dA \\ &= 2A:X\,(dW\,Z + W\,dZ) \\ &= 2X^TAZ^T:dW \;+\; 2W^TX^TA:dZ \\ \frac{\p\b}{\p W} &= 2X^TAZ^T \qqq \{ {\rm gradient\;wrt\;}W \} \\ \frac{\p\b}{\p Z} &= 2W^TX^TA \\ }$$