How to derive differential gradient descent from cost function using matrix calculus?

Question

How to derive differential gradient descent from cost function using matrix calculus?

256 Views Asked by Bumbble Comm At 31 Mar 2026 - 10:28

I am new to matrix calculus and I hope you can help me with this question!

I’m trying to derive the update law for gradient descent to minimize a cost function $$j(\mathbf{X}) = \frac{1}{2} \mathbf{e}^T(\mathbf{X}) \mathbf{e}(\mathbf{X})$$ with $s \in \mathbb{R}$ and $\mathbf{b}, \mathbf{e} \in \mathbb{R}^n$ and $\mathbf{A}, \mathbf{X} \in \mathbb{R}^{n\times m}$. First I calculate the gradient of $j$ w.r.t. a vectorized $\mathbf{x} := \operatorname{vec} \mathbf{X} \in \mathbb{R}^{nk}$ $$\nabla_\mathbf{x} j(\mathbf{X}) = \left(\frac{\partial j}{\partial \mathbf{x}}\right)^T = \left(\frac{\partial \mathbf{e}(\mathbf{X})}{\partial \mathbf{x}}\right)^T \mathbf{e}(\mathbf{X}).$$ For $\mathbf{e}$ it is known that $$\frac{\mathrm{d}\mathbf{e}(\mathbf{X}(s), s)}{\mathrm{d}s} = \mathbf{A} \mathbf{X}^T(s) \mathbf{b}.$$ Now I try to somehow derive the gradient using the chain rule $$\nabla_\mathbf{x} j(\mathbf{X}) = \left(\frac{\mathrm{d} \mathbf{e}(\mathbf{X}(s), s)}{\mathrm{d}s} \frac{\mathrm{d} s}{\mathrm{d} \mathbf{x}(s)}\right)^T \mathbf{e} = \left(\mathbf{A} \mathbf{X}^T(s) \mathbf{b} \frac{\mathrm{d} s}{\mathrm{d} \mathbf{x}(s)}\right)^T \mathbf{e} = \left(\frac{\mathrm{d} s}{\mathrm{d} \mathbf{x}(s)}\right)^T \operatorname{Tr}\left(\mathbf{X}^T(s) \mathbf{b} \mathbf{e}^T \mathbf{A}\right) = \left(\frac{\mathrm{d} s}{\mathrm{d} \mathbf{x}(s)}\right)^T \mathbf{x}^T(s) \operatorname{vec}\left(\mathbf{b} \mathbf{e}^T \mathbf{A}\right) =\ ...?$$ The update law for $\mathbf{X}$ is stated as $$\frac{\mathrm{d} \mathbf{X}}{\mathrm{d} s} = - \mathbf{\Gamma} \mathbf{b} \mathbf{e}^T \mathbf{A}$$ where $\mathbf{\Gamma}$ is a diagonal $\mathbb{R}^{n \times n}$ matrix of update rates.

I don’t understand how the update law (last equation) can be derived from the gradient (second to last equation). Could anyone please help me? Thank you so much in advance!

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2021-02-20 13:57:43

$\def\B{\Big}\def\L{\left}\def\R{\right}\def\p#1#2{\frac{\partial #1}{\partial #2}}\def\P#1#2{\frac{d #1}{d #2}}$Using the given expression for the error function, calculate its differential $$\eqalign{ e &= e_0 + A\L(\small\int_{X_0}^{X} dX\R)^Tb \quad\implies\quad \color{red}{de = A\,dX^Tb} \\ }$$ Calculate the differential of the cost function and its gradient. $$\eqalign{ j &= \tfrac 12e:e \\ dj &= e:\color{red}{de} \\ &= e:\color{red}{A\,dX^Tb} \\ &= e^T:b^TdX\,A^T \\ &= be^TA:dX \\ \p{j}{X} &= be^TA &\quad\big({\rm gradient}\big) \\ }$$ Now use an iterative gradient-descent method to calculate the optimal $X$ $$\eqalign{ X_{k+1} &= X_k - \lambda_k\Big(be^T_kA\Big) \\ }$$ where the scalar $\lambda_k>0$ is the step-length which minimizes the cost function for the $k^{th}$ iteration. The initial value (at $k=0$) of $X$ is likely $\,X_0=X(s_0),\,$ or it could be a random matrix.

Apparently, your textbook proposes to replace the standard scalar step-length with a matrix $$\lambda_kI \to \Gamma_k \\\\$$

In the preceeding, a colon denotes the trace/Frobenius product, i.e. $$\eqalign{ A:B = {\rm Tr}(A^TB) \\ }$$ The properties of the trace function allow the terms in such a product to be rearranged in various ways, e.g. $$\eqalign{ A:B &= B:A = B^T:A^T \\ CA:B &= C:BA^T = A:C^TB \\ }$$

How to derive differential gradient descent from cost function using matrix calculus?

There are 1 best solutions below

Related Questions in MATRIX-CALCULUS

Related Questions in GRADIENT-DESCENT

Trending Questions

Popular # Hahtags

Popular Questions