Are ReLU a very bad choice as last layer for regression?

189 Views Asked by At

Let's say that we have a very simple single layer neural network, and we can descrive the structure as following: $$ a = Wx+b\\ o = ReLU(a)\\ J = (t - o)^2 \\\text{ online learning} $$

however, when backpropagating the gradient to $W$ we have the following: $$ \frac{\partial J}{\partial W} = \frac{\partial J}{\partial o}\frac{\partial o}{\partial a} \frac{\partial a}{\partial W} $$ However, the second term is the derivative of the ReLU layer respect to the activation, which is a matrix (in this case a vector 1xN) that have element in $\{0,1\}$, which means that if we have a single neuron as output (in our case $o$), which (for example for bad choice of weights initialization) has always preactivation less than 0, the network will never learn anything, because the gradient will always be 0 (since $\frac{\partial o}{\partial a} = 0$)

I'm asking for regression, because maybe our target is always $>0$, and somebody might pick ReLU seeing that the domain matches