How can we take derivatives on $UV$ w.r.t a vector $\mathbf{x}$ by using the chain rule and $d(UV)=dU\cdot V + U\cdot dV$?

164 Views Asked by At

There is an example about this question on Page 564 in Chapter 11 of B&V's Convex Optimization book. It presents the gradient and Hessian matrix of the following log barrier function,

$$\tag{11.5} \phi(x)=-\sum_{i=1}^m \log(-f_i(x)) $$ $$ \nabla \phi(x)=\sum_{i=1}^m\frac{1}{-f_i(x)} \nabla f_i(x)\\ \nabla^2 \phi(x)=\sum_{i=1}^m\frac{1}{f_i(x)^2} \nabla f_i(x)\nabla f_i(x)^T+\sum_{i=1}^m\frac{1}{-f_i(x)} \nabla^2 f_i(x) $$

I could not figure out the first term of the Hessian. According to the chain rule and the product rule, $U=\frac{1}{-f_i(x)}$ and $V=\nabla f_i(x)$, and $dU=\frac{1}{f_i(x)^2}\nabla f_i(x)$. Hence, the first term of the Hessian is supposed to be $dU\cdot V=\frac{1}{f_i(x)^2}\nabla f_i(x)\cdot \nabla f_i(x)$, but it does not satisfy the requirement of matrix product for dimensionality since $\nabla f_i(x)$ is a vector. I know the quoted form for the first term of Hessian meets the dimensionality requirement and makes sense in the respective of dimensionality consistency with the second term. Can anybody give me some instruction on this kind of case for calculating derivatives using both chain rule and pruduct rule simultaneously? I will appreciate any instructions.

3

There are 3 best solutions below

7
On

This is a very informal answer, it can be formalised but it gets even more tedious and I think it is instructive to see where the terms come from using linear approximations.

The main issue is how we view derivatives, in particular the 2nd derivative. Take a function $\psi:X \to Y$, then $\psi'(x) \in L(X,Y)$, that is a linear map from $X$ to $Y$. We write $\psi'(x)(h)$ to indicate $\psi'(x)$ applied to $h$.

In a similar manner, $\psi''(x) \in L(X,L(X,Y))$, that is a linear map from $X$ into the space of linear maps $L(X,Y)$. We could write $\psi''(x)(h)$ to indicate a map $L(X,Y)$ and $\psi''(x)(h)(w)$ to indicate that map evaluated at $w$. Since we can identify the space $L(X,L(X,Y))$ with the space of bilinear forms $X \times X \to Y$, it is more usual to write $\psi''(x)(h,w)$, and the matrix representation of $(h,w) \mapsto \psi''(x)(h,w)$ is the Hessian.

In particular, to first order, we have $\psi'(x+h)(w) \approx \psi'(x)(w)+ \psi''(x)(h,w)$.

Now consider a single term of the form $\psi(x) = g(f(x))$. The chain rule gives $\psi'(x)(h) = g'(f(x))(f'(x)(h))$.

Now consider $\psi'(x+h^*)$ applied to $h$. To first order we have \begin{eqnarray} \psi'(x+h^*)(h) &=& g'(f(x+h^*))(f'(x+h^*)(h)) \\ &\approx& g'(f(x)+f'(x)(h^*))(f'(x+h^*)(h)) \\ &\approx& g'(f(x))(f'(x+h^*)(h))+g''(f(x))(f'(x)h^*,f'(x+h^*)(h)) \\ &\approx& g'(f(x))(f'(x)(h)+f''(x)(h^*,h))+g''(f(x))(f'(x)h^*,f'(x)(h)+f''(x)(h^*,h))\\ &=& g'(f(x))f'(x)(h) + g'(f(x))(f''(x)(h^*,h)) + g''(f(x))(f'(x)h^*,f'(x)(h)) + g''(f(x))(f'(x)h^*,f''(x)(h^*,h)) \end{eqnarray} If we retain the first order terms in $h^*$ we get $\psi''(x)(h^*,h) = g'(f(x))(f''(x)(h^*,h)) + g''(f(x))(f'(x)h^*,f'(x)(h))$.

Now let $g(t) = - \log (-t)$, since this is scalar valued we write $g'(t) = - {1 \over t}$, $g''(t) = {1 \over t^2}$.

If we use the notation $f'(x)(h) = \nabla f(x)^T h$ and $f''(x)(h^*,h) = (h^*)^T \nabla^2 f(x) h$ we get \begin{eqnarray} \psi''(x)(h^*,h) &=& - {1 \over f(x)} (h^*)^T \nabla^2 f(x) h + {1 \over f(x)^2} \nabla f(x)^T h^* \nabla f(x)^T h \\ &=& (h^*)^T \left[ - {1 \over f(x)} \nabla^2 f(x) + {1 \over f(x)^2} \nabla f(x) \nabla f(x)^T \right] h \end{eqnarray} and so $\nabla^2 \psi(x) = - {1 \over f(x)} \nabla^2 f(x) + {1 \over f(x)^2} \nabla f(x) \nabla f(x)^T$.

0
On

$ \def\sd{\cdot}\def\dd{:} \def\o{{\large\tt1}}\def\p{\partial} \def\H{{\cal H}} \def\B{\Big}\def\L{\left}\def\R{\right} \def\LR#1{\L(#1\R)} \def\BR#1{\B(#1\B)} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad&\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} $For typing convenience, define the Jacobian and Hessian of $f(x)$ as $$\eqalign{ J &= \grad{f}{x} &\qiq J_{ij} = \grad{f_i}{x_j} &\qiq df = J\sd dx \\ \H &= \grad{J}{x} &\qiq \H_{ijk} = \grad{J_{ij}}{x_k} &\qiq dJ = \H\sd dx \\ }$$ and the following vector variables $$\eqalign{ a &= \o \oslash f \qiq A = \Diag{a},\;\; a=A\sd\o \\ da &= -a\odot a\odot df \;&\;=\quad -A^2\sd J\sd dx \\ b &= \log(-f) \\ db &= a\odot df &\;=\qquad A\sd J\sd dx \\ }$$ where log() is applied elementwise, $\;\{\odot,\oslash\}$ denote elementwise multiplication and division, and $\;\{\sd\}$ denotes the dot-product. The all-ones vector is $\o$.

Now write the barrier function and calculate its gradient using the above notation. $$\eqalign{ \phi &= -\o\sd b \\ d\phi &= -\o\sd db \\ &= -\o\sd \LR{A\sd J\sd dx} \\ &= -\LR{J^T\sd A\sd \o}\sd dx \\ &= -\LR{J^T\sd a}\sd dx \\ \grad{\phi}{x} &= -\;{J^T\sd a} \;\;\doteq\;\; g \\ }$$ So that's the gradient, now calculate the gradient of the gradient (i.e. the Hessian). $$\eqalign{ dg &= -\;{J^T\sd da} \;-\; dJ^T\sd a \\ &= \LR{J^T\sd A^2\sd J\sd dx} \;-\; \LR{\H\sd dx}^T\sd a \\ &= \BR{J^T\sd A^2\sd J \;-\; a\sd\H}\sd dx \\ \grad{g}{x} &= {J^T\sd A^2\sd J \;-\; a\sd\H} \;\;\doteq\;\; \grad{^2\phi}{x\,\partial x^T} \\ }$$ Writing this in index notation might make things clearer $$\eqalign{ \grad{^2\phi}{x_i\,\partial x_j} &= \LR{\sum_{k=\o}^m\sum_{\ell=\o}^m J_{ki}\,A^2_{k\ell}\,J_{\ell j}} \;-\; \LR{\sum_{k=\o}^m a_k\;\H_{kij}} \\ }$$

0
On

Because of the nature of the cost function, we can reason on a single component and sum everything at the end. The differential readily writes $$ d\phi = \frac{-df_n}{f_n} $$

Introducing the gradient column vector $\nabla f_n(\mathbf{x})$ , we can write $ df_n = (\nabla f_n)^T d\mathbf{x} $.

The gradient is thus $$ \mathbf{g} = \frac{\partial \phi}{\partial \mathbf{x}} = \frac{-1}{f_n} \nabla f_n (\mathbf{x}) $$

Introducing the Hessian matrix $d\nabla f_n = \mathbf{H}_{f_n} d\mathbf{x}$, it follows that \begin{eqnarray*} d\mathbf{g} &=& \frac{df_n}{f_n^2} \nabla f_n (\mathbf{x}) -\frac{1}{f_n} (d\nabla f_n) \\ &=& \frac{1}{f_n^2} \nabla f_n (\mathbf{x}) (\nabla f_n)^T d\mathbf{x} -\frac{1}{f_n} \mathbf{H}_{f_n} d\mathbf{x} \end{eqnarray*} The Hessian is thus $$ \mathbf{H}_\phi= \frac{1}{f_n^2} \nabla f_n (\mathbf{x}) (\nabla f_n)^T -\frac{1}{f_n} \mathbf{H}_{f_n} $$