Minimizing a neural network containing relu

448 Views Asked by At

I have a question regarding the minimization of a neural network containing $Relu(x):=\max\{0,x\}$.

Since Relu is convex, the subgradient method can be used to minimize the Relu function.

However, Relu is part of a neural network $f$. Training a neural network means to minimize a function $L(W)= \sum_{i = 1}^{N} d(f(x_{i},W),g_{i})$, where we have $N$ training sample $x_{i}$ with corresponding label $g_{i}$ and $d$ is a metric.

How do we minimize $L$ with respect to $W$ if relu is non-differentiable, but $f$ non-convex?

In more detail, there are two subsequent questions:

  1. How is it optimized in practice ?
  2. What are the theoretical aspects to this question? In which case is convergence to a local optimum guaranteed (may be only almost sure).
1

There are 1 best solutions below

4
On

Your observation is correct. The subgradient method only converges to a minimum for convex functions, hence there is a caveat here. However in practice, this is not a problem for several reasons. First, since the problem is not convex then we are looking for a local minimum, rather than global, to begin with. There is still a theoretical problem with the convergence of the algorithm in general (we want it to stop).

Recall that in practice, Stochastic Gradient Descent (and not subgradient) is used for optimization of neural networks. Indeed, using Gradient Descent with on a non-differentiable function can cause the algorithm to cycle, but this problem is more related to the step-size choice. But Stochastic Gradient Descent uses a different stepsize strategy that makes this problem less probable. Moreover, shuffling caused by the use of mini-batches further reduces the possibility of that to happen.

To conclude, neural networks have a problem with convergence and the consistency of results. However this is "ignored" and in practice algorithms converge, but not to an optimal solution as in the subgradient method with appropriate assumptions (which include convexity).