I am new to matrix calculus and I hope you can help me with this question!
I’m trying to derive the update law for gradient descent to minimize a cost function $$j(\mathbf{X}) = \frac{1}{2} \mathbf{e}^T(\mathbf{X}) \mathbf{e}(\mathbf{X})$$ with $s \in \mathbb{R}$ and $\mathbf{b}, \mathbf{e} \in \mathbb{R}^n$ and $\mathbf{A}, \mathbf{X} \in \mathbb{R}^{n\times m}$. First I calculate the gradient of $j$ w.r.t. a vectorized $\mathbf{x} := \operatorname{vec} \mathbf{X} \in \mathbb{R}^{nk}$ $$\nabla_\mathbf{x} j(\mathbf{X}) = \left(\frac{\partial j}{\partial \mathbf{x}}\right)^T = \left(\frac{\partial \mathbf{e}(\mathbf{X})}{\partial \mathbf{x}}\right)^T \mathbf{e}(\mathbf{X}).$$ For $\mathbf{e}$ it is known that $$\frac{\mathrm{d}\mathbf{e}(\mathbf{X}(s), s)}{\mathrm{d}s} = \mathbf{A} \mathbf{X}^T(s) \mathbf{b}.$$ Now I try to somehow derive the gradient using the chain rule $$\nabla_\mathbf{x} j(\mathbf{X}) = \left(\frac{\mathrm{d} \mathbf{e}(\mathbf{X}(s), s)}{\mathrm{d}s} \frac{\mathrm{d} s}{\mathrm{d} \mathbf{x}(s)}\right)^T \mathbf{e} = \left(\mathbf{A} \mathbf{X}^T(s) \mathbf{b} \frac{\mathrm{d} s}{\mathrm{d} \mathbf{x}(s)}\right)^T \mathbf{e} = \left(\frac{\mathrm{d} s}{\mathrm{d} \mathbf{x}(s)}\right)^T \operatorname{Tr}\left(\mathbf{X}^T(s) \mathbf{b} \mathbf{e}^T \mathbf{A}\right) = \left(\frac{\mathrm{d} s}{\mathrm{d} \mathbf{x}(s)}\right)^T \mathbf{x}^T(s) \operatorname{vec}\left(\mathbf{b} \mathbf{e}^T \mathbf{A}\right) =\ ...?$$ The update law for $\mathbf{X}$ is stated as $$\frac{\mathrm{d} \mathbf{X}}{\mathrm{d} s} = - \mathbf{\Gamma} \mathbf{b} \mathbf{e}^T \mathbf{A}$$ where $\mathbf{\Gamma}$ is a diagonal $\mathbb{R}^{n \times n}$ matrix of update rates.
I don’t understand how the update law (last equation) can be derived from the gradient (second to last equation). Could anyone please help me? Thank you so much in advance!
$\def\B{\Big}\def\L{\left}\def\R{\right}\def\p#1#2{\frac{\partial #1}{\partial #2}}\def\P#1#2{\frac{d #1}{d #2}}$Using the given expression for the error function, calculate its differential $$\eqalign{ e &= e_0 + A\L(\small\int_{X_0}^{X} dX\R)^Tb \quad\implies\quad \color{red}{de = A\,dX^Tb} \\ }$$ Calculate the differential of the cost function and its gradient. $$\eqalign{ j &= \tfrac 12e:e \\ dj &= e:\color{red}{de} \\ &= e:\color{red}{A\,dX^Tb} \\ &= e^T:b^TdX\,A^T \\ &= be^TA:dX \\ \p{j}{X} &= be^TA &\quad\big({\rm gradient}\big) \\ }$$ Now use an iterative gradient-descent method to calculate the optimal $X$ $$\eqalign{ X_{k+1} &= X_k - \lambda_k\Big(be^T_kA\Big) \\ }$$ where the scalar $\lambda_k>0$ is the step-length which minimizes the cost function for the $k^{th}$ iteration. The initial value (at $k=0$) of $X$ is likely $\,X_0=X(s_0),\,$ or it could be a random matrix.
Apparently, your textbook proposes to replace the standard scalar step-length with a matrix $$\lambda_kI \to \Gamma_k \\\\$$
In the preceeding, a colon denotes the trace/Frobenius product, i.e. $$\eqalign{ A:B = {\rm Tr}(A^TB) \\ }$$ The properties of the trace function allow the terms in such a product to be rearranged in various ways, e.g. $$\eqalign{ A:B &= B:A = B^T:A^T \\ CA:B &= C:BA^T = A:C^TB \\ }$$