Thanks to the scientific community, things are getting clear relatively to the question: what is the gradient of a function $f(X)$ when $X$ is a symmetric matrix?.
In particular, I report here some useful links that addressed this question in the past and can be used as a reference to proceed further on this discussion:
- Understanding notation of derivatives of a matrix
- Taylor expansion of a function of a symmetric matrix
- https://arxiv.org/pdf/1911.06491.pdf
In a nutshell, we can say that, when involving a function with matrix argument, we have to distinguish between two "different", but related, gradients:
- the unconstrained gradient $G$, computed with standard matrix calculus without assuming dependent variables in the matrix $X$, and used for the computation of the differential of the function, i.e. $G:dX$
- the constrained gradient $S$, that considers only the independent variables of the matrix $X$.
These two gradients are related by the expression:
$$S=G+G^{T}-I \circ G $$
and it turns out that the first-order differential of the function $f$ at a given point $X$ after a perturbation $\Delta X$ can be computed as:
$$ d f=\sum_{i, j} G_{i j} d X_{i j} = \sum_{i \geq j} S_{i j} d X_{i j}$$
It is important however to note how, in an iterative algorithm that updates a variable $X^{k+1}$ (such as in gradient descent), we have to use the constrained gradient $S$ and not the gradient $G$, due to the fact that $X$ is symmetric while the gradient $G$ could be not symmetric.
More information can be found in the above links, that explain the relation also in terms of $vec(\cdot)$ and $vech(\cdot)$ operators.
Coming to my question. I want now to find the Hessian of the function $f(X)$, that in theory is a $4$th order tensor and we already know the mangy road crisscrossed to get to the gradient.
To start, is it correct to perturb the first-order differential (with the unconstrained gradient)? If yes, I will reach a scalar quadratic form. For instance, if we consider as function $f(X)=\log \operatorname{det} X$, we know that the second order approximation with perturbation in $U$ and $V$ is given by (and I reference this question Second order approximation of log det X):
$$-\operatorname{tr}\left(X^{-1} U X^{-1} V\right) = - \operatorname{vec}(U^{\top})^{\top}(X^{-\top} \otimes X^{-1}) \operatorname{vec}(V)$$
We can arrive at the Hessian in matrix form $X^{-\top} \otimes X^{-1}$.
My first question is: how to write it in a tensor form?
And second question is: how to reach in this case our constrained Hessian?
If a matrix is symmetric, then by definition it contains redundant information. This redundancy can be eliminated by using the half-vec representation, i.e. $$\eqalign{ x &= {\rm vech}(X) \;\iff\; X = {\rm unvech}(x) \\ }$$ It may be necessary to temporarily reconstitute the matrix in order to evaluate matrix-specific funtions (like the inverse, trace or determinant), but otherwise any iterative process (like gradient descent or quasi-Newton) should be performed within the half-vec space.
There are two advantages to the half-vec representation. The first is that $x$ is unconstrained, so there is no need to enforce a symmetry constraint between steps. The second is that $x$ is a vector, so it is not necessary to invoke 4th order tensors in order to calculate the Hessian. $\,$ Such tensors are required when the independent variable is a matrix.
For the example function
$$\eqalign{ \phi &= \log\det{\rm unvech}(x) \\ &= \log\det X \\ d\phi &= X^{-1}:dX \\ &= {\rm vec}\big(X^{-1}\big):{\rm vec}(dX) \\ &= {\rm vec}\big(X^{-1}\big):D\,dx \qquad&\big({\rm Duplication\,matrix}\big)\\ &= D^T{\rm vec}\big(X^{-1}\big):dx \\ \frac{\partial\phi}{\partial x} &= D^T{\rm vec}\big(X^{-1}\big) \;\doteq\; g \qquad&\big({\rm half\,vec\,gradient}\big)\\ }$$ The Hessian matrix can be calculated as $$\eqalign{ dg &= D^T{\rm vec}\big(dX^{-1}\big) \\ &= -D^T{\rm vec}\big(X^{-1}\,dX\,X^{-1}\big) \\ &= -D^T\big(X^{-1}\otimes X^{-1}\big)\,{\rm vec}(dX) \\ &= -D^T\big(X\otimes X\big)^{-1}D\,dx \\ \frac{\partial g}{\partial x} &= -D^T\big(X\otimes X\big)^{-1}D \;\doteq\; H \qquad&\big({\rm half\,vec\,Hessian}\big)\\ }$$ In this form it is obvious that $H^T=H,\,$ as it should.
Then an iteration step would look like $$\eqalign{ x_{k+1} &= x_k - \lambda_k g_k \qquad&\big({\rm gradient\,step}\big)\\ x_{k+1} &= x_k - \lambda_k H_k^{-1} g_k \qquad&\big({\rm Newton\,step}\big) \\ }$$ Once the solution vector $x_\infty$ is calculated, it can be put in matrix form $$X_\infty = {\rm unvech}(x_\infty)$$ If you insist on carrying out iterations in the symmetric matrix space, the Newton step is hopeless but the gradient step can be salvaged by applying the ${\rm unvech}$ operation to each term $$\eqalign{ X_{k+1} &= X_k - \lambda_k\big(2X_k^{-1}-I\circ X_k^{-1}\big) \\\\ }$$ Although nothing can be done for the Newton step in the matrix space, there is an important simplification in the half-vec space $$\eqalign{ H_k^{-1} &= D^+(X_k\otimes X_k){D^+}^{T} \\ }$$ where $D^+$ denotes the pseudoinverse of the duplication matrix, which is constant for all steps. So there is no need to calculate matrix inverses at each step, beyond the one required to calculate $g$ itself.
In fact, you don't need to calculate a pseudoinverse either, since $D^+$ is equal to $D^T$ but with its rows normalized so as to make them stochastic, i.e. to sum to ${\tt1}$.
NB: In the steps above, a colon denotes the trace/Frobenius product, i.e. $$\eqalign{ A:B = {\rm Tr}(A^TB) = {\rm Tr}(AB^T) }$$ Also, several steps take advantage of the fact that $X$ is symmetric.
Update
There are some functions which losslessly shuffle and/or reshape the elements of a matrix without changing their values.Examples include: vec, transpose, blockvec, vecpose, and their inverses.
Let $S(X)$ denote one of these shuffling functions. $\,S$ exhibits some interesting properties with respect to addition, subtraction, and Hadamard / Frobenius multiplication. $$\eqalign{ S(X_1) \pm S(X_2) &= S(X_1 \pm X_2) \\ S(X_1) \circ S(X_2) &= S(X_1 \circ X_2) \\ S(X_1):S(X_2) &= X_1:X_2 \\ }$$ In particular, the subtraction property means that $$\eqalign{ dS(X) &= S(X+dX) - S(X) \;=\; S(dX) \\ }$$ These properties were used implicitly in some of the steps of the derivation above, specifically for the vec/unvec functions.