Chain rule for differentiation yields conflicting dimensions

191 Views Asked by At

Suppose I have differentiable functions (in the sense of the frechet derivative) $f\colon \mathbb{R} \to \mathbb{R}^{n\times n} $ and $g\colon \mathbb{R}^{n} \to \mathbb{R}$ , where $f$ is a linear operator, and want to compute the (frechet) derivative of their composition, i.e. $f \circ g\colon \mathbb{R}^{n} \to \mathbb{R}^{n, n} $. Using the chain rule for normed spaces I obtain \begin{align*} D(f \circ g (\mathbf{x}))h = \underbrace{ (\mathrm{D}f)(g(\mathbf{x}))}_{\in \mathbb{R}^{n\times n} } \cdot \underbrace{ \mathrm{D}g(\mathbf{x})}_{\in \mathbb{R}^{1\times n} }h, \quad h \in \mathbb{R} .\end{align*} How can this hold since the dimensions of the product do not match up?

Edit: Consider the function \begin{align*} f(\mathbf{x}) = \mathbf{A}(\mathbf{x})\mathbf{x} .\end{align*} with \begin{align*} \mathbf{A}(\mathbf{x})\colon \mathbb{R}^{n} \to \mathbb{R}^{n, n} , \quad \mathbf{A}(\mathbf{x}) = \begin{bmatrix} \alpha(\mathbf{x}) & 1 & 0& 0 & \cdots & 0 \\ 1 & \alpha(\mathbf{x}) & 1 & 0 & \cdots & 0 \\ 0 & 1 & \alpha(\mathbf{x}) & 1 & \cdots & 0 \\ \vdots & \cdots & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & 0 & 1 & \alpha(\mathbf{x})& 1 \\ 0 & 0 & 0 & 0 & 1 & \alpha(\mathbf{x}) \end{bmatrix} \end{align*} whereby $\alpha(\mathbf{x}) = \left\|\mathbf{x}\right\|_{2} $. My professor now rewrites the above as \begin{align*} \mathbf{A}(\mathbf{x})\mathbf{x} = \mathbf{T}\mathbf{x} + \mathbf{x}\left\|\mathbf{x}\right\|_{2} , \quad \mathbf{T}:=\left[\begin{array}{cccccc} 3 & 1 & & & & \\ 1 & 3 & 1 & & & \\ & \ddots & 3 & \ddots & & \\ & & \ddots & \ddots & \ddots & \\ & & & 1 & 3 & 1 \\ & & & & 1 & 3 \end{array}\right] .\end{align*} He then finds \begin{align*} \mathrm{D}f(\mathbf{x}) \mathbf{h}=\mathbf{T h}+\|\mathbf{x}\|_{2} \mathbf{h}+\mathbf{x} \frac{\mathbf{x}^{\top} \mathbf{h}}{\|\mathbf{x}\|_{2}} =\left(\mathbf{A}(\mathbf{x})+\frac{\mathbf{x} \mathbf{x}^{\top}}{\|\mathbf{x}\|_{2}}\right) \mathbf{h} \end{align*}

from which I concluded that \begin{align*} \mathrm{D}(\mathbf{A}(\mathbf{x}))= \frac{\mathbf{x}\mathbf{x}^{\mathsf{T}}}{\left\|\mathbf{x}\right\|_{2} } .\end{align*}

3

There are 3 best solutions below

4
On

For the purposes of this type of multivariate calculus (e.g. Frechet derivatives), a domain or codomain of $m\times n$ real matrices is identified with $\Bbb R^{mn}$, not with $\Bbb R^{m\times n}$. You "flatten" your matrices before you do derivatives and chain rules on them. At least if you want your derivative at a given point to be represented by a standard rectangular grid of numbers.

If you don't want to flatten your matrices before you do calculus in them, then your derivatives will be cuboids of higher dimension. You're now venturing into what I would call tensor calculus territory.

Edit: After looking at your example, here is what I think happens: $\mathbf A$ is a function $\Bbb R^n\to \Bbb R\to \Bbb R^{n\times n}$, the way you describe. But $f$ is a function $\Bbb R^n\to \Bbb R^n$, and as such its Frechet derivative may be realized as an $n\times n$ matrix. You have been given the matrix $Df = \mathbf{A}(\mathbf{x})+ \dfrac{\mathbf{x}\mathbf{x}^{\mathsf{T}}}{\left\|\mathbf{x}\right\|_{2} }$ as this derivative.

0
On

The Frechet derivative can in fact be defined for maps between arbitrary normed spaces, in particular for maps into the space of $(n \times n)$-matrices. Given $\phi : V \to W$. the Frechet derivative of $\phi$ at $x \in V$ is a linear map $D\phi(x) = D\phi \mid_x : V \to W$. The chain rule says that $$D(\psi \circ \phi)(x) = D(\psi)(\phi(x)) \circ D\phi(x). $$ Applying this linear map to $h \in V$ gives $$D(\psi \circ \phi)(x)(h) = D(\psi)(\phi(x))(D\phi(x)(h)).$$ It seems that you do not properly distiguish between linear maps and their matrix representations, and this may be confusing.

There is no dimension problem in your question: $Dg(x)$ is a linear map $\mathbb R^n \to \mathbb R$, hence $D\phi(x)(h) \in \mathbb R$. Also $Df(g(x))$ is a linear map $\mathbb R \to \mathbb R^{n \times n}$ and of course you insert $Dg(x)(h)$ as an argument. In terms of matrices: $Dg(x)$ is an $(n \times 1)$-matrix, $Df(g(x))$ is an $(1 \times n^2)$-matrix (where we identify $\mathbb R^{n \times n}$ with $\mathbb R^{n^2}$ to obtain a "standard" matrix with real entries) and its product an $n \times n^2$-matrix. You write $Df(g(x)) \in \mathbb R^{n \times n}$, but this is wrong since it suggests that it is an $(n \times n)$-matrix.

Concerning your example I do not see the connection to the chain rule. $f$ is not the composition of two functions as needed in the chain rule. But certainly your conclusion $$D(\mathbf{A}(\mathbf{x}))= \frac{\mathbf{x}\mathbf{x}^{\mathsf{T}}}{\left\|\mathbf{x}\right\|_{2} }$$ fails. $D \mathbf A(x)$ is a linear map $D \mathbf A(x) : \mathbb R^n \to \mathbb R^{n \times n}$, but $\frac{\mathbf{x} \mathbf{x}^{\top}}{\|\mathbf{x}\|_{2}}$ is a single $(n \times n)$-matrix.

2
On

$ \def\bbR#1{{\mathbb R}^{#1}} \def\a{\alpha}\def\o{{\tt1}}\def\p{\partial} \def\E{{\cal E}} \def\L{\left}\def\R{\right} \def\LR#1{\L(#1\R)} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} $A matrix-by-vector gradient generates a third-order tensor, so it does not fit comfortably in standard matrix notation.

However, the differential of a matrix has the shape of a matrix and obeys all of the rules of matrix algebra. Likewise, the differential of a vector obeys the familiar rules of vector algebra.

First, consider the differentials of the constituent functions. $$\eqalign{ \a^2 &= x^Tx \\ 2\a\,d\a &= 2x^Tdx &\qiq d\a = \frac{x^Tdx}{\a} \\ A &= B + I\a &\qiq dA = I\,d\a \\ }$$ With these we can differentiate the composite function using back-substitution. $$\eqalign{ f &= Ax \\ df &= A\,dx + dA\,x \\ &= A\,dx + x\,d\a \\ &= \LR{A + \frac{xx^T}{\a}}dx \\ \grad{f}{x} &= \LR{A + \frac{xx^T}{\a}} \\ }$$ If you really need the third-order tensor, its $$\eqalign{ \grad{A}{x} &= \LR{\frac{I\star x}{\a}} \\ }$$ where $\star$ denotes the dyadic (aka tensor) product.