Matrix derivative $\frac{∂(Σm_{ik}m_{jk})}{∂m_{xy}}=2m_{xy}$?

103 Views Asked by At

My question is about the matrix derivative in a paper called Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks (image link). I am interested in the proof of $∂P/∂M=2M$. I have found several posts on the site and I think $∂P/∂M$ should be a 4-D tensor like other posts said. Why this paper says it is $2M$ (2-D)?

$M$ is a parameterized matrix. And we want this matrix to be close to a semi-orthogonal matrix. $$P≡MM^T, Q≡P-I, f=tr(QQ^T)$$ I understand the first 2 equations, but I have no idea why the last one is like that. Can anyone give some proof? $$∂f/∂Q=2Q$$ $$∂Q/∂P=I$$ $$∂P/∂M=2M$$

Thanks in advance.

Edit

Why the following could be written as $2M$? I think only the element on the diagonal can reach $2m_{ij}$. \begin{align} ∂P/∂M &= \frac{∂(Σm_{ik}m_{jk})}{∂m_{xy}} \\ &= \frac{Σ∂(m_{ik}m_{jk})}{∂m_{xy}} \\ &= Σ\frac{∂m_{ik}}{∂m_{xy}}m_{jk}+m_{ik}\frac{∂m_{jk}}{∂m_{xy}} \\ \end{align}

1

There are 1 best solutions below

4
On BEST ANSWER

Consider the mapping $\mathbf{P}:\mathcal{M}_{n\times m}(\mathbb{R})\rightarrow \mathcal{M}_{n\times n}(\mathbb{R})$ given by $\mathbf{P}(\mathbf{M}) = \mathbf{M}\mathbf{M}^T$ and let $F:\mathcal{M}_{n\times n}(\mathbb{R})\rightarrow \mathbb{R}$ given by $F(\mathbf{P}):=\mathbf{x}^T\mathbf{P}\mathbf{x}$. Define $g:\mathcal{M}_{m\times n}(\mathbb{R})\rightarrow \mathbb{R}$ given by $g(\mathbf{M})= (F\circ \mathbf{P})(\mathbf{M}).$

Observe the directional derivative of $g$ is given by \begin{align} D_\mathbf{M} g(\mathbf{A})=&\ \frac{d}{d\epsilon} g(\mathbf{M}+\epsilon \mathbf{A})\big|_{\epsilon =0} = \frac{d}{d\epsilon} \mathbf{x}^T\left( \mathbf{M}+\epsilon \mathbf{A}\right)\left( \mathbf{M}+\epsilon \mathbf{A}\right)^T\mathbf{x}\big|_{\epsilon =0}\\ =&\ \frac{d}{d\epsilon} \mathbf{x}^T\left( \mathbf{M}\mathbf{M}^T+\epsilon \mathbf{A}\mathbf{M}^T+\epsilon\mathbf{M}\mathbf{A}^T+ \epsilon^2 \mathbf{A}\mathbf{A}^T\right)\mathbf{x}\big|_{\epsilon =0}\\ =&\ \mathbf{x}^T\mathbf{M}\mathbf{A}^T\mathbf{x}+\mathbf{x}^T\mathbf{A}\mathbf{M}^T\mathbf{x} = 2\mathbf{x}^T\mathbf{A}\mathbf{M}^T\mathbf{x} \end{align} and \begin{align} & D_\mathbf{M} g(\mathbf{A}) = (D_\mathbf{P}F\circ D_\mathbf{M}\mathbf{P})( \mathbf{A}) = \mathbf{x}^T\left(D_\mathbf{M}\mathbf{P}(\mathbf{A}) \right)\mathbf{x} \\ & \ \ \implies \ \ D_\mathbf{M}\mathbf{P} = 2\mathbf{M}^T\implies\ \frac{\partial \mathbf{P}}{\partial \mathbf{M}} = 2\mathbf{M}. \end{align}

These equalities hold in some weak sense.

Additional: Consider $f:\mathcal{M}_{n\times m}(\mathbb{R})\rightarrow \mathbb{R}$ given by $f(\mathbf{P})=\text{tr}((\mathbf{P}-I)(\mathbf{P}-I)^T)$. Note that $\mathcal{M}_{n\times m}(\mathbb{R})$ is an inner product space with $\langle\mathbf{A}, \mathbf{B}\rangle :=\text{tr}(\mathbf{A}\mathbf{B}^T)$. Then we write $f(\mathbf{P}) = \langle\mathbf{P}-I, \mathbf{P}-I\rangle$. Finally, we see that \begin{align} \left\langle \frac{\partial f}{\partial \mathbf{P}}, \mathbf{A}\right\rangle =& D_\mathbf{P}f(\mathbf{A}) =\ \frac{d}{d\epsilon} f(\mathbf{P}+\epsilon \mathbf{A})\Big|_{\epsilon = 0} = \frac{d}{d\epsilon} \langle(\mathbf{P}+\epsilon \mathbf{A}-I), (\mathbf{P}+\epsilon \mathbf{A}-I)\rangle\Big|_{\epsilon = 0}\\ =&\ 2\langle \mathbf{P}-I, \mathbf{A}\rangle = \langle 2\mathbf{Q}, \mathbf{A}\rangle. \end{align} Hence it follows $\partial f/\partial \mathbf{P} = 2\mathbf{Q}$.