Assume $x \in R^d$ and $f(x)$ returns a scaler.
I am trying to calculate
$$\nabla_{x} \left( \|\nabla_{x} f(x)\|_1\right)$$
Is there any way to apply the chain rule and calculate this?
Assume $x \in R^d$ and $f(x)$ returns a scaler.
I am trying to calculate
$$\nabla_{x} \left( \|\nabla_{x} f(x)\|_1\right)$$
Is there any way to apply the chain rule and calculate this?
On
$ \newcommand\R{\mathbb R} $
If $g : \R \to \R$, then for $x \in \R^d$ let $g[x] \in \R^d$ be the result of applying $g$ to each component of $x$, i.e. $$ x = \begin{pmatrix}x_1 \\ x_2 \\ \vdots \\ x_d\end{pmatrix} \implies g[x] = \begin{pmatrix}g(x_1) \\ g(x_2) \\ \vdots \\ g(x_d)\end{pmatrix}. $$ Also let $x \odot y$ be the Hadamard product of $x, y \in \R^d$, i.e. $$ \begin{pmatrix}x_1 \\ \vdots \\ x_d\end{pmatrix} \odot \begin{pmatrix}y_2 \\ \vdots \\ y_d\end{pmatrix} = \begin{pmatrix} x_1y_1 \\ \vdots \\ x_dy_d\end{pmatrix}. $$
We make use of two chain rules for gradients:
Note that the $L^1$-norm is equivalent to $$ ||x||_1 = ||g[x]||_2^2 $$ where $g : \R \to \R$ is given by $g(a) = \sqrt{|a|}$ and $||\cdot||_2$ is the $L^2$-norm. Using the chain rules above, we get $$\begin{aligned} \nabla||\nabla f(x)||_1 &= \nabla||g[\nabla f(x)]||_2^2 \\ &= \dot\nabla\bigl[\nabla f(\dot x)\bigr]\cdot\bigl[\nabla_y||g[y]||_2^2\bigr]_{y=\nabla f(x)} \\ &= \dot\nabla\bigl[\nabla f(\dot x)\bigr]\cdot\Bigl[g'[y]\odot\bigl[\nabla_z||z||_2^2\bigr]_{z=g[y]}\Bigr]_{y=\nabla f(x)} \\ &= \dot\nabla\bigl[\nabla f(\dot x)\bigr]\cdot\Bigl[g'[\nabla f(x)]\odot\bigl(2g[\nabla f(x)]\bigr)\Bigr] \end{aligned}$$ If $H_f$ is the Hessian matrix of $f$, then we can see that $$ \dot\nabla (\nabla f(\dot x))\cdot y = H_fy, $$ hence $$ \nabla||\nabla f(x)||_1 = 2H_f\Bigl[g'[\nabla f(x)]\odot g[\nabla f(x)]\Bigr]. $$ Noting that $$ g'(a) = \frac{\mathrm d}{\mathrm da}\sqrt{|a|} = \frac{\mathrm{sign}(a)}{2\sqrt{|a|}} $$ where $\mathrm{sign}(a)$ is the sign (-1, 0, or 1) of $a$, we compute $$ g'(a)g(a) = \frac{\mathrm{sign}(a)}{2\sqrt{|a|}}\sqrt{|a|} = \frac12\mathrm{sign}(a), $$ and hence our final expression is $$ \nabla||\nabla f(x)||_1 = H_f\,\mathrm{sign}[\nabla f(x)]. $$
Let $f_i(x) = \frac{\partial}{\partial x_i} f(x)$ so that and similarly let $f_{ij}(x) = \frac{\partial^2}{\partial x_i \partial x_j}f(x)$. Also I will use $$ \text{sign}(z) = \begin{cases} -1 & z < 0 \\ 0 & z = 0 \\ 1 & z > 0 \end{cases}. $$ The gradient of a function $f$ is always given by \begin{equation} \nabla f(x) = \begin{bmatrix} f_1(x) \\ f_2(x) \\ \vdots\\ f_d(x)\end{bmatrix} \end{equation} and therefore \begin{equation} ||\nabla f(x)||_1 = \sum_{i=1}^d |f_i(x)| \\ \end{equation} Note that there are some differentiability issues if $f_i(x) = 0$ for any $i$ but I'm going to ignore those (they can be handled with sub-differentials if needed).
Recall that $$ \frac{\partial}{\partial z} |z| = \text{sign}(z) $$ (assuming you want to handle $0$ in the standard way).
Now, lets look at one partial derivative of norm. \begin{align} \frac{\partial }{\partial x_j} ||\nabla f(x)||_1 &= \frac{\partial f}{\partial x_j} \left ( \sum_{i=1}^d |f_i(x)| \right ) \\ &= \sum_{i=1}^d \frac{\partial }{\partial x_j} |f_i(x)| \\ &= \sum_{i=1}^d \text{sign}(f_i(x))\cdot \frac{\partial }{\partial x_j} f_i(x) \\ &= \sum_{i=1}^d \text{sign}(f_i(x))\cdot f_{ij}(x) \\ &= \sum_{i=1}^d f_{ji}(x) \text{sign}(f_i(x)) \end{align} The third line is just the chain rule. The reason I've done this last expression is because it looks a lot like a vector multiplication with the way the summation is being done over $i$. In paricular, If $H(x)$ is the Hessian matrix of $f$ at the point $x$, that is $H_{ji}(x) = f_{ji}(x)$, then the $j$th partial is the product of the $j$th row of the Hessian $H_j(x)$ and the signs of the gradient, $$ \frac{\partial }{\partial x_j} ||\nabla f(x)||_1 = \sum_{i=1}^d f_{ji}(x) \text{sign}(f_i(x)) = H_j(x) \cdot \text{sign}(\nabla f(x)) $$ where the sign operation is applied entrywise. From this it follows that the complete gradient is given by \begin{align} \nabla || \nabla f(x)||_1 = \begin{bmatrix} H_1(x) \cdot \text{sign}(\nabla f(x)) \\ H_2(x) \cdot \text{sign}(\nabla f(x)) \\ \vdots \\ H_d(x) \cdot \text{sign}(\nabla f(x)) \end{bmatrix} = \begin{bmatrix} H_1(x) \\ H_2(x) \\ \vdots \\ H_d(x) \end{bmatrix} \text{sign}(\nabla f(x)) = H(x) \text{sign}(\nabla f(x)) \end{align} It's worthwhile to check for yourself that this is indeed a vector of length $d$.