The task is to prove that for any matrix $X$ and differentiable scalar function $f$, the following holds:
$$ \frac{\partial f(X^\top X)}{\partial X} = 2{X}\frac{\partial f(X^\top X)}{\partial (X^\top X)}. $$
Following the Chain Rule, I could already denote that:
$$ \frac{\partial f(X^\top X)}{\partial X} = \frac{\partial f(u)}{\partial X} = \frac{\partial f(u)}{\partial u}\frac{\partial u}{\partial X} = \frac{\partial f(X^\top X)}{\partial (X^\top X)}\frac{\partial X^\top X}{\partial X} $$
for $u$ being a function of $X$ with $u(X) = X^{\top}X$
Now, I have difficulty proving that $\dfrac{\partial (X^{\top}X)}{\partial X} = 2X$.
I know from this previous question as well as The Matrix Cookbook, that for a vector x, the following holds for the squared Euclidian Norm: $\dfrac{||\textbf{x}||^2_2}{\partial x} = \dfrac{||x^{\top} x||_2}{\partial x} = 2x$, but how can I go from this to rigorously arguing the same for $\dfrac{\partial f(X^{\top} X)}{\partial X} $? Or is there an even simpler way to do it?
Any hints would be appreciated! Thank you very much.
I'm not sure what the notation $\frac{\partial \alpha (X)}{\partial X}$ means, but you can write $f(X^\top X)=(f\circ g\circ h)(X)$ for $h(X):=(X^\top ,X)$ and $g(X,Y):=XY$, then using the Fréchet derivative and for every matrix $H$ the chain rule gives $$ \begin{align*} \partial [f(X^\top X)]H&=(\partial f\circ g\circ h)(X)(\partial g\circ h)(X)\partial h(X)H\\ &=\partial f(X^\top X)\partial g(X^\top ,X)\partial h(X)H\\ &=\partial f(X^\top X)\partial g(X^\top ,X)(H^\top ,H)\\ &=\partial f(X^\top X)(X^\top H+H^\top X) \end{align*} $$
Hope it helps.