Derivative of $\frac{\partial f(X^{\top}X)}{\partial X}$

144 Views Asked by At

The task is to prove that for any matrix $X$ and differentiable scalar function $f$, the following holds:

$$ \frac{\partial f(X^\top X)}{\partial X} = 2{X}\frac{\partial f(X^\top X)}{\partial (X^\top X)}. $$

Following the Chain Rule, I could already denote that:

$$ \frac{\partial f(X^\top X)}{\partial X} = \frac{\partial f(u)}{\partial X} = \frac{\partial f(u)}{\partial u}\frac{\partial u}{\partial X} = \frac{\partial f(X^\top X)}{\partial (X^\top X)}\frac{\partial X^\top X}{\partial X} $$

for $u$ being a function of $X$ with $u(X) = X^{\top}X$

Now, I have difficulty proving that $\dfrac{\partial (X^{\top}X)}{\partial X} = 2X$.

I know from this previous question as well as The Matrix Cookbook, that for a vector x, the following holds for the squared Euclidian Norm: $\dfrac{||\textbf{x}||^2_2}{\partial x} = \dfrac{||x^{\top} x||_2}{\partial x} = 2x$, but how can I go from this to rigorously arguing the same for $\dfrac{\partial f(X^{\top} X)}{\partial X} $? Or is there an even simpler way to do it?

Any hints would be appreciated! Thank you very much.

3

There are 3 best solutions below

0
On

I'm not sure what the notation $\frac{\partial \alpha (X)}{\partial X}$ means, but you can write $f(X^\top X)=(f\circ g\circ h)(X)$ for $h(X):=(X^\top ,X)$ and $g(X,Y):=XY$, then using the Fréchet derivative and for every matrix $H$ the chain rule gives $$ \begin{align*} \partial [f(X^\top X)]H&=(\partial f\circ g\circ h)(X)(\partial g\circ h)(X)\partial h(X)H\\ &=\partial f(X^\top X)\partial g(X^\top ,X)\partial h(X)H\\ &=\partial f(X^\top X)\partial g(X^\top ,X)(H^\top ,H)\\ &=\partial f(X^\top X)(X^\top H+H^\top X) \end{align*} $$

Hope it helps.

4
On

Before proving this claim, it is important to clarify the notation. Given a function $f:M_{m\times n}(\Bbb C)\to \Bbb C$, we define its derivative $\frac{\partial f(X)}{\partial X}$ as another $m\times n$ matrix $D_f=[d_{i,j}]_{m\times n}$ such that $$ d_{i,j}=\frac{\partial f(X)}{\partial x_{i,j}}, $$ where the differentiation has been performed w.r.t. $X=[x_{i,j}]_{m\times n}$. Now, by defining $Y=X^TX=[y_{i,j}]_{n\times n}$, we have $y_{i,j}=\sum_{k=1}^mx_{k,i}x_{k,j}$ and we can write $$ \frac{\partial f(X^TX)}{\partial x_{i,j}}{ = \sum_{\mu=1}^n\sum_{\nu=1}^n\frac{\partial f(X^TX)}{\partial y_{\mu,\nu}}\frac{\partial y_{\mu,\nu}}{\partial x_{i,j}} \\= \sum_{\mu=1}^n\sum_{\nu=1}^n\frac{\partial f(X^TX)}{\partial y_{\mu,\nu}}\frac{\partial \sum_{\kappa=1}^mx_{\kappa,\mu}x_{\kappa,\nu}}{\partial x_{i,j}} \\= \sum_{\mu=1\\\mu\ne j}^n\sum_{\nu=1\\\nu\ne j}^n\frac{\partial f(X^TX)}{\partial y_{\mu,\nu}}\frac{\partial \sum_{\kappa=1}^mx_{\kappa,\mu}x_{\kappa,\nu}}{\partial x_{i,j}} \\+ \sum_{\nu=1\\\nu\ne j}^n\frac{\partial f(X^TX)}{\partial y_{j,\nu}}\frac{\partial \sum_{\kappa=1}^mx_{\kappa,j}x_{\kappa,\nu}}{\partial x_{i,j}} \\+ \sum_{\mu=1\\\mu\ne j}^n\frac{\partial f(X^TX)}{\partial y_{\mu,j}}\frac{\partial \sum_{\kappa=1}^mx_{\kappa,\mu}x_{\kappa,j}}{\partial x_{i,j}} \\+ \frac{\partial f(X^TX)}{\partial y_{j,j}}\frac{\partial \sum_{\kappa=1}^mx_{\kappa,j}x_{\kappa,j}}{\partial x_{i,j}} \\= 0 + \sum_{\nu=1\\\nu\ne j}^n\frac{\partial f(X^TX)}{\partial y_{j,\nu}}x_{i,\nu} + \sum_{\mu=1\\\mu\ne j}^n\frac{\partial f(X^TX)}{\partial y_{\mu,j}}x_{i,\mu} + \frac{\partial f(X^TX)}{\partial y_{j,j}}2x_{i,j} \\= \sum_{\nu=1\\\nu\ne j}^n\frac{\partial f(X^TX)}{\partial y_{j,\nu}}2x_{i,\nu} + \frac{\partial f(X^TX)}{\partial y_{j,j}}2x_{i,j} \\= \sum_{\nu=1}^n\frac{\partial f(X^TX)}{\partial y_{j,\nu}}2x_{i,\nu}. } $$ From here, $\frac{\partial f(X^TX)}{\partial y_{i,j}}$ are the entries of $\frac{\partial f(X^TX)}{\partial X^TX}$. Therefore, according to definition of matrix multiplication, $\sum_{\nu=1}^n\frac{\partial f(X^TX)}{\partial y_{j,\nu}}2x_{i,\nu}$ are the entries of $2X\frac{\partial f(X^TX)}{\partial X^TX}$ and the proof is complete $\blacksquare$

0
On

$ \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\frob#1{\left\| #1 \right\|_F} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} $The matrix inner product (denoted by a colon) is extremely useful in these sorts of problems.
It is essentially a product notation for the trace $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \frob{A}^2 \qquad \{ {\rm Frobenius\;norm} \} \\ }$$ This is also called the double-dot, or Frobenius product.
When applied to vectors $(n={\tt1})$ it reduces to an ordinary dot product.
Terms in a Frobenius product can be rearranged in many useful ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A &= \LR{A^TC}:B \\ }$$ Define $\:Y=X^TX\:$ and assume that we know the gradient $\:G=\large{\grad fY}$

Use $G$ to calculate the differential of the function, then perform a change variables from $Y\to X.\;$ Then recover the gradient with respect to $X$ $$\eqalign{ df &= G:dY \\ &= G:\LR{dX^TX+X^TdX} \\ &= 2\,G:\LR{X^TdX} \\ &= \LR{2XG}:dX \\ \grad fX &= 2XG \\ }$$