subdifferential of ReLU function composition with affine function

2k Views Asked by At

Let $f:\mathbb{R}^n\to\mathbb{R}^n$ be ReLU function, i.e., $f(x)=[\max(0,x_i)]_{i=1}^n$. Let also $g:\mathbb{R}^m \to \mathbb{R}^n$ be an affine function $g(x)=Ax+b$. What is the subdifferential of $f(g(x))$?

2

There are 2 best solutions below

10
On BEST ANSWER

The subdifferential of the composed function $h(x) := f(Ax+b)$ is simply $\partial h(x) = A^\top \partial f(Ax+b)$. Thus if suffices to find $\partial f(x)$.

For one-dimensional ReLU, $r(z) = \max(0,z)$ with $z\in\mathbb{R}$, it is straightforward to find its subdifferential (with a slight abuse of notation: a single element represents a singleton set containing that element): $$\partial r(z) = \begin{cases} 1 &\mbox{if } z > 0,\\ [0,1] &\mbox{if } z=0,\\ 0 &\mbox{if } z < 0. \end{cases}$$ Sketch of proof: $r(z)$ is differentiable every except at $z=0$, thus $\partial r (z) = r'(z) \ \forall z \neq 0$. For $z=0$ it is easy to show that $\partial r (z) = [0,1]$ based on the definition of subgradient.

For general $n$-dimensional ReLU ($\mathbb{R}^n\to \mathbb{R}^n$), $f(x) = \max(0,x)$ with $x\in \mathbb{R}^n$, and for vector-valued functions in general, there are different ways of generalizing the notion of subgradient. Once can use, for example, generalized inequalities based on convex cones (see this paper for example). Another solution is to use Clarke Jacobian (which is the Clarke subdifferential for vector-valued function). For the ReLU function, it can be shown that these two kinds of "subgradients" coincide (if one chooses the nonnegative orthant as the underlying cone in the first kind). I'll give a detailed derivation for the Clarke subdifferential because it's more common.

The Clarke subdifferential is the set of Clarke Jacobian matrices, defined as follows for a locally Lipschitz function $F:\mathbb{R}^n\to \mathbb{R}^m$: \begin{equation} D_CF(x) = \mathrm{co}\left\{\lim_{i\to\infty}DF(x_i): x_i\to x, x_i\not\in E\cup E_f \right\},\tag{*} \end{equation} where $E\in \mathbb{R}^n$ is any set of measure zero, $E_f$ is the set of points at which $F$ is not differentiable, $DF$ is the Jacobian matrix, and $\mathrm{co}$ denotes the convex hull. In other words: Take any sequence $x_i$ converging to $x$ while avoiding both $E$ and $E_f$, such that $DF(x_i)$ converges to a limit, then the convex hull of all such limits is $D_CF(x)$. Clearly, $D_CF(x)$ is a set of $m\times n$ matrices, and it is well-known that if $F$ is differentiable at $x$, then $D_CF(x) = \{DF(x)\}$.

Now let's compute $D_Cf$ for our ReLU function $f$. It is easy to see that $f$ is differentiable at any $x = (x_1,\dots,x_n)$ such that $x_i\neq 0 \ \forall i$, and the Jacobian at such point can be computed as \begin{equation} \partial f_i(x_j) = \begin{cases} 0 &\mbox{if } i\neq j \mbox{ or } x_j < 0,\\ 1 &\mbox{otherwise}. \end{cases} \end{equation} (The Jacobian is thus a diagonal matrix where the diagonal elements are either $0$ or $1$). Clearly, we can partition $\mathbb{R}^n$ into $2^n$ subsets such that $f$ is differentiable on their interiors, and that $Df$ is constant over each subset. These $2^n$ subsets correspond to precisely the $2^n$ orthants of $\mathbb{R}^n$, and the $2^n$ Jacobians correspond to the $2^n$ diagonal matrices whose diagonals are from $\{0,1\}^n$. Now, for a point $x$ on the boundaries of the orthants (at which $f$ is clearly not differentiable), we can compute $D_Cf(x)$ according to $(*)$ as follows.

Suppose that $x$ lies in the intersection of $k$ orthants $S_1,\dots,S_k$, and that these orthants correspond to the (constant) Jacobians $G_1,\dots,G_k$. For each sequence $x_i$ that converges to $x$ (such that $x_i$ does not meet the boundaries), if $i$ is large enough then obviously $DF(x_i)$ can only take one of the values $G_1,\dots,G_k$, i.e., the sequence $DF(x_i)$ admits cluster points in the set $\{ G_1,\dots,G_k \}$. On the other hand, if we choose the sequence $x_i$ such that all of its elements belong to $S_j$ ($1\le j \le k$), then automatically $DF(x_i)$ converges to $G_j$ because $DF(x_i) = G_j \ \forall i$ (i.e., a constant sequence). We conclude that all the possible limits of in $(*)$ are $\{ G_1,\dots,G_k \}$ and therefore $D_Cf(x) = \mathrm{co} \{ G_1,\dots,G_k \}$.

It remains to determine $\{ G_1,\dots,G_k \}$. Because $x$ is on the boundaries, $x$ must contain at least one zero element. Suppose that $x$ contains exactly $p$ zero elements, then $x$ lies in the intersection of the $k = 2^{p}$ orthants corresponding to the $2^{p}$ Jacobians of the form $G=\mathrm{diag}(g_1,\dots,g_n)$, where \begin{equation} g_i = \begin{cases} 0 &\mbox{if } x_i < 0,\\ 1 &\mbox{if } x_i > 0,\\ \mbox{either } 0 \mbox{ or } 1 &\mbox{if } x_i = 0. \end{cases} \end{equation} The convex hull of these $2^{p}$ Jacobians is clearly the set of matrices $\mathrm{diag}(c_1,\dots,c_n)$ such that \begin{equation} c_i = \begin{cases} 0 &\mbox{if } x_i < 0,\\ 1 &\mbox{if } x_i > 0,\\ \mbox{any value in } [0,1] &\mbox{if } x_i = 0. \end{cases} \end{equation} For example, in $\mathbb{R}^3$, the point $x = (0, \color{blue}{2}, \color{red}{-7}, 0)$ lies in the intersection of the $4$ orthants whose Jacobians are $\mathrm{diag}(0, \color{blue}{1}, \color{red}{0}, 0)$, $\mathrm{diag}(0, \color{blue}{1}, \color{red}{0}, 1)$, $\mathrm{diag}(1, \color{blue}{1}, \color{red}{0}, 0)$, and $\mathrm{diag}(1, \color{blue}{1}, \color{red}{0}, 1)$, and thus \begin{align} D_Cf(x) &= \mathrm{co} \left\{\mathrm{diag}(0, \color{blue}{1}, \color{red}{0}, 0), \mathrm{diag}(0, \color{blue}{1}, \color{red}{0}, 1), \mathrm{diag}(1, \color{blue}{1}, \color{red}{0}, 0), \mathrm{diag}(1, \color{blue}{1}, \color{red}{0}, 1)\right\} \\ &= \left\{\mathrm{diag}(c_1, \color{blue}{1}, \color{red}{0}, c_4): c_1\in [0,1], c_4\in [0,1]\right\}. \end{align}

All the above just to arrive at the following simple rule for computing the subdifferential of the $n$-dimensional ReLU function $f$ at a point $x=(x_1,\dots,x_n)$:

  • Compute $g_i \in \partial \mathrm{ReLU}(x_i)$ (i.e., element-wise subgradient).
  • Then $\mathrm{diag}(g_1,\dots,g_n)$ is a Clarke Jacobian of $f$ at $x$.

If the reader find the above derivation hard to understand, I would recommend to apply it to $\mathbb{R}^2$, the idea will become clearer quickly.

2
On

$ \def\L{\left}\def\R{\right}\def\LR#1{\L(#1\R)} \def\step#1{\operatorname{\sigma}\LR{#1}} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\o{{\tt1}}\def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} $Let $\sigma$ denote the Heaviside step function (which is the derivative of the ReLu function) $$\eqalign{ \step{z} = \begin{cases} \o \quad {\rm if}\;z>0 \\ 0 \quad {\rm otherwise} \\ \end{cases} }$$ Define the vector variables $$\eqalign{ g &= Ax+b, \qquad f &= \max(0,g), \qquad h &= \step{g} \\ }$$ A diagonal matrix $\;H = \Diag{h}\;$ can replace an elementwise/Hadamard product $$h\odot w = Hw$$ Now calculate the differential and gradient of $f$ $$\eqalign{ df &= h\odot dg \\&= h\odot\LR{A\,dx} \\&= H{A\;dx} \\ \grad{f}{x} &= HA \\ }$$ $\ldots$ or the transpose of this, depending on your preferred layout convention.