Let $X_1, \ldots, X_N \in \mathbb{R}^p$ and $Y_1, \ldots, Y_N \in \mathbb{R}$. Define $$ X=\left[\begin{array}{c} X_1^{\top} \\ \vdots \\ X_N^{\top} \end{array}\right] \in \mathbb{R}^{N \times p}, \quad Y=\left[\begin{array}{c} Y_1 \\ \vdots \\ Y_N \end{array}\right] \in \mathbb{R}^N $$
Let $$ \ell_i(\theta)=\frac{1}{2}\left(X_i^{\top} \theta-Y_i\right)^2 \quad \text { for } i=1, \ldots, N, \quad \mathcal{L}(\theta)=\frac{1}{2}\|X \theta-Y\|^2 . $$
Show (a) $\nabla_\theta \ell_i(\theta)=\left(X_i^{\top} \theta-Y_i\right) X_i$ and (b) $\nabla_\theta \mathcal{L}(\theta)=X^{\top}(X \theta-Y)$.
Hint. For part (a), start by computing $\frac{\partial}{\partial \theta_j} \ell_i(\theta)$. For part (b), use the fact that $$ M v=\sum_{i=1}^N M_{:, i} v_i \in \mathbb{R}^p $$ for any $M \in \mathbb{R}^{p \times N}, v \in \mathbb{R}^N$, where $M_{:, i}$ is the $i$ th column of $M$ for $i=1, \ldots, N$.
My Preliminary Progress: For part (a): I started with the definition of $\ell_i(\theta)$ and applied the chain rule to compute the gradient, arriving at: $$ \nabla_\theta \ell_i(\theta)=\left(X_i^{\top} \theta-Y_i\right) X_i $$
For part (b): Using the overall loss $\mathcal{L}(\theta)$, I leveraged matrix differentiation rules and the hint provided regarding matrix-vector multiplication to deduce that: $$ \nabla_\theta \mathcal{L}(\theta)=X^{\top}(X \theta-Y) $$
My Questions:
- Rigorous Derivation: Could someone provide a step-by-step, rigorous derivation of both (a) and (b), possibly highlighting any subtleties in matrix calculus that I might have glossed over?
- Matrix Calculus Techniques: Are there specific matrix calculus techniques or identities that are particularly useful in simplifying these derivations?
- Generalization: How might these results generalize to other loss functions or optimization problems within machine learning?
I believe that understanding these derivations in depth will greatly enhance my comprehension of optimization in statistical learning. Thank you in advance for your time and help!
$ \def\R#1{{\mathbb R}^{#1}} \def\L{{\cal L}} \def\t{\theta} \def\k{\otimes} \def\h{\tfrac12\:} \def\o{{\tt1}} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\frob#1{\left\| #1 \right\|_F} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} \def\c#1{\color{red}{#1}} $Here are a few things that I find very helpful for Matrix Calculus problems.
The Frobenius Product is denoted by a colon and will help you avoid silly transposition errors.
It has the following properties $$\eqalign{ A,B&\in\R{m\times n}\qquad X\in\R{m\times p}\qquad Y\in\R{p\times n} \\ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ B:B &= \frob{B}^2 \qquad \{ {\rm Frobenius\;norm} \}\\ A:B &= B:A \;=\; B^T:A^T \\ \LR{XY}:B &= X:\LR{BY^T} \;=\; Y:\LR{X^TB} \\ A:B &= I_m\!:\LR{BA^T} \;=\; I_n:\LR{A^TB} \\ }$$ Next is the differential of $\L$, which is denoted as ${d\L}$ and defined in terms of the gradient $\gradLR{\L}{X}$ and the differential $dX$ (which obeys the same rules of Matrix Algebra as $X$ itself) $$\eqalign{ d\L &= \gradLR{\L}{X}:dX \\ }$$ Using differentials allows you to avoid the awkward matrix-by-matrix and matrix-by-vector gradients which are often required by the Chain Rule.
Next are the standard basis vectors $\{e_i\}$ and matrices $\{E_{jk}\}$ whose components are all $0$ except for the one component indicated by the subscripts, which equals $\o$. Interestingly they are also the component-wise$\;$self-gradients of an arbitrary vector or matrix variable $$\eqalign{ \grad x{x_i} = e_i \qquad\qquad \grad X{X_{jk}} = E_{jk} }$$ In your problem, note that $$\eqalign{ e_k^T\LR{X\t-Y} = \LR{X_k^T\t-Y_k} \\ }$$ Let's examine your cost functions using the above notation $$\eqalign{ \ell_k &= \h e_k^T\LR{X\t-Y}:e_k^T\LR{X\t-Y} \\ &= \h e_ke_k^T:\LR{X\t-Y}\LR{X\t-Y}^T \\ &= \h E_{kk}:\LR{X\t-Y}\LR{X\t-Y}^T \\ \\ \L &= \h \LR{X\t-Y}:\LR{X\t-Y} \\ &= \h I :\LR{X\t-Y}\LR{X\t-Y}^T \\ &= \LR{\sum_{k=\o}^N \h E_{kk}}:\LR{X\t-Y}\LR{X\t-Y}^T \;\equiv\; \sum_{k=\o}^N \ell_k \\ }$$ As you can see, the functions are closely related.
I'll do a step-by-step calculation for one of them and leave the other to you.
For typing convenience, define the matrix variable $$\eqalign{ M &= \LR{X\t-Y} \qiq \c{dM = X\,d\t} \\ }$$ Then $$\eqalign{ \L &= \h M:M \\ d\L &= \h dM:M \;+\; \h M:dM \\ &= M:\c{dM} \\ &= M:\c{X\,d\t} \\ &= X^TM:d\t \\ &= X^T\LR{X\t-Y}:d\t \\ \grad{\L}{\t} &= X^T\LR{X\t-Y} \\ }$$