Understanding numerator/denominator layout in matrix-calculus

988 Views Asked by At

This is a distilled version of this question.

Consider the following: $$ \begin{align} z & = f(\mathbf{y}) \\ \mathbf{y} & = g(\mathbf{x}) \\ \text{where, } & z \in \mathbb{R} \text{, and} \\ & \mathbf{y}, \mathbf{x} \text{ are two $(1, m)$ dimensional vectors, i.e. row-vectors} \end{align} $$

Using numerator-layout, what is the dimension of the derivative $\frac{\mathrm{d}z}{\mathrm{d}\mathbf{y}}$?

  • Should it be a column-vector of dimension $(m, 1)$, because $\mathbf{y}$ is a row-vector of dimension $(1, m)$ (Source)
    • But, using this notation causes issues while computing the derivative $\frac{\mathrm{d} z}{\mathrm{d} \mathbf{x}} = \frac{\mathrm{d} z}{\mathrm{d} \mathbf{y}} \frac{\mathrm{d} \mathbf{y}}{\mathrm{d} \mathbf{x}}$; since, $\frac{\mathrm{d} \mathbf{y}}{\mathrm{d} \mathbf{x}}$ would be an $(m, m)$ matrix, while $\frac{\mathrm{d} z}{\mathrm{d} \mathbf{y}}$ is an $(m, 1)$ vector.
    • However, this notation does serve well when computing the derivatives of the form $\frac{\mathrm{d} h(\mathbf{X})}{\mathrm{d}\mathbf{X}}$, where $\mathbf{X}$ is a matrix of dimension $(m, n)$; and $h(\mathbf{X})$ is a scalar-valued function.
  • Or should it be a row-vector, because according to the numerator-layout the derivative has the dimensions --> $\text{numerator-dimension} \times (\text{denominator-dimension})^\intercal = (1,1)\times(m, 1)$ (Source)
    • Also, (for this point) is my understanding even correct?

PS: also, is there any definitive guide from which I can learn matrix-calculus from the first principals. Although, the following sources are good, they still leave a lot of gaps:

1

There are 1 best solutions below

0
On

$ \def\qiq{\quad\implies\quad} \def\trace#1{\operatorname{Tr}(#1)} \def\shape#1{\operatorname{shape}(#1)} \def\LR#1{\left(#1\right)} \def\p{\partial}\def\o{\tt1}\def\z{\zeta} \def\grad#1#2{\frac{\partial #1}{\partial #2}} $Let's use a convention where uppercase Latin denotes a matrix, lowercase Latin a vector, and Greek letters are scalars. Always write the column vector as $y$ and the corresponding row vector as $y^T$. Finally, let's use a colon to denote the matrix inner product, i.e. $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij}\;=\;\trace{A^TB} \\ A:A &= \big\|A\big\|^2_F \\ }$$ When applied to column $\big(n=\o\big)$ or row $\big(m=\o\big)$ vectors this corresponds to the usual dot product.

Now write the functions using column vectors, and calculate their Jacobians, gradients and differentials as
$$\eqalign{ y &= y(x) \qiq J=\grad{y}{x} &\qiq dy = J\,dx \\ \z &= \z(y) \qiq g = \grad{\z}{y} &\qiq d\zeta = g:dy \\ }$$ Now calculate the gradient of $\z$ with respect to $x$ by back substitution $$\eqalign{ d\z &= g:dy = g:J\,dx = J^Tg:dx \\ \grad{\z}{x} &= J^Tg = \LR{\grad{y}{x}}^T\LR{\grad{\z}{y}} \\ }$$ Next, consider the case of a matrix argument $Y$ $$\eqalign{ \z &= \z(Y) \qiq G = \grad{\z}{Y} &\qiq d\z = G:dY \\ }$$ In general, the shape of the gradient of a scalar-valued function should match the shape of the independent variable, e.g. $\shape{g}=\shape{y}\;$ and $\;\shape{G}=\shape{Y}$.

The shape of the Jacobian of a vector-value function is such that it can be dotted with the independent vector, i.e. such that $\,J\,dx\;$ is dimensionally compatible.


To make the formulas work with row vectors, simply transpose everything. $$\eqalign{ y^T &= y^T(x^T) &\qiq dy^T = dx^TJ^T \\ \z &= \z(y^T) &\qiq g^T = \grad{\z}{y^T} \\ \\ d\z &= dy^T:g^T \\&= dx^TJ^T:g^T \\&= dx^T:g^TJ \\ \grad{\z}{x^T} &= g^TJ \\ \\ }$$