In class, we called a real-valued function from the space of matrices to the reals $f: \mathbb{R}^{m \times n} \rightarrow \mathbb{R}$ differentiable at $\mathbf{X}$ if:
$$\lim_{\mathbf{H} \to \mathbf{0_{m \times n}}} \frac{\lvert\lvert f(\mathbf{X} + \mathbf{H}) - f(\mathbf{X}) - tr([\nabla f(\mathbf{X})^T]\mathbf{H})\rvert \rvert}{\lvert\lvert \mathbf{H}\rvert\rvert} = 0$$
where the gradient is the transpose of the total derivative. In this definition of differentiability, I'm trying to understand the intuition behind using the $tr(\cdot)$.
- In this case, is it true that the total derivative $\mathscr{D}f$ is a map $\mathscr{D}f : \mathbb{R}^{m \times n} \rightarrow \mathscr{L}(\mathbb{R}^{m \times n},\mathbb{R})$?
- If the above is true, then $tr([\mathscr{D}f](\mathbf{H})) = [\mathscr{D}f](\mathbf{H})$ since $[\mathscr{D}f](\mathbf{H})$ is a real number. So do we take the trace just for convenience in manipulating the algebraic expressions of the matrices? Could we use $det(\cdot)$ instead?
Thanks.
Short answer: The function $\mathbf{X},\mathbf{Y}\to \operatorname{trace}(\mathbf{X}^T \mathbf{Y})$ defines an inner product on the space $\mathbb{R}^{m\times n}$.
Long answer: In the general setting of a finite-dimensional spaces $E$, the usual definition of Frechet derivative of function $f: E\to \mathbb R$ at point $x$ is the linear map $L\in \mathcal L(E,\mathbb R)$ such that $$\lim_{{h} \to 0} \frac{\lvert\lvert f(x+h) - f(x) - L(h)\rvert \rvert}{\lvert\lvert h\rvert\rvert} = 0$$
If you equip $E$ with an inner product $\langle .,. \rangle$, Riesz representation theorem tells you that there is a unique vector $u\in E$ such that $\forall y \in E, L(y)=\langle y,u \rangle$. The vector $u$ is usually called "gradient of $f$ at $x$", that's what you note $\nabla f(\mathbf{X})$. Now, note that the function $\mathbf{X},\mathbf{Y}\to \operatorname{trace}(\mathbf{X}^T \mathbf{Y})$ defines an inner product on the space $\mathbb{R}^{m\times n}$ and you're done.