I am trying to optimize this cost function by using Gauss-Newton method. $$f = \sum_{i = 1}^n Tr{(Z^TZ)}$$ where $Z$ is a $4\times4$ matrix and it is a function of real vector $\vec{a}\in\mathbb{R}^5$. And because I am using Gauss-Newton approach, I define $g \in\mathbb{R}^{n}$ to be $\sqrt{Tr(Z^T_iZ_i)} = \sqrt{p_i}$, where $i = 1,2,...,n$
This is my attempt to find the gradient and its Hessian $H$. Gradient of the cost function is $2J^Tg$ and Hessian is $2(J^TJ + \sum(g_iH_i))$, where $J$ is Jacobian matrix.
$$\frac{\partial g}{\partial\epsilon}|_{\epsilon = 0} = \frac{\partial g}{\partial p}\frac{\partial p}{\partial Z}\frac{\partial Z}{\partial\epsilon}|_{\epsilon = 0}$$ $$\frac{\partial g}{\partial\epsilon}|_{\epsilon = 0} = \frac{1}{2\sqrt{p_i}}Z(\epsilon\vec{a})\frac{\partial Z(\epsilon\vec{a})}{\partial\epsilon}|_{\epsilon = 0}$$
Apparently, the expression above is going to be $4\times4$ matrix. How to proceed find the gradient of the cost function because we cannot multiply $J$ and $g$ right away? I have tried with Hessian matrix and it is even worse. I am not sure whether I am on the right path or not so I did not show my attempt on Hessian here.
Your latest formulation of the cost function can be greatly simplified $$ \eqalign { f &= \sum_k {\rm tr}\bigg((A_kX)^T(A_kX)\bigg) \cr &= \sum_k (A_k^TA_k):(XX^T) \cr &= M:(XX^T) \cr } $$ by summing the $A_k$ once and for all, and removing them from the problem. I also replaced the trace function with a Frobenius product for algebraic convenience.
Next, instead of jumping immediately to the chain rule, let's find the differential of $f$. $$ \eqalign { df &= M:d(XX^T) \cr &= M:(dXX^T + XdX^T) \cr &= (M+M^T):dXX^T \cr &= (M+M^T)X:dX \cr &= 2\,MX:dX \cr } $$ You haven't provided enough information about $X=X(a)$ for me to evaluate the derivative, but it will look like this $$ \eqalign { \frac {\partial f} {\partial a^T} &= 2\,MX : \bigg(\frac {\partial X} {\partial a^T}\bigg) \cr &= 2\,{\rm vec}(MX)^T\,\,\frac {\partial {\rm vec}(X)} {\partial a^T} \cr } $$