I was seeking for an alternate way to activate each neuron of a neural network non-linearly. Eventually, I came up with the following binary operation: $$ x \lor y = \log (\exp x + \exp y) $$
With $-\infty$ being its identity element, along with associativity, this operator roughly works like $\max$. As such, I call this operator smooth max. To provide it some learnable parameters, use it as $f(x_1,\cdots,x_n) = b_0 \lor (w_1x_1+b_1) \lor \cdots \lor (w_nx_n+b_n)$.
I think this provides a good way of activating the neuron (in the next layer) they target, because of the following reasons:
This operator is smooth.
For $\frac{\partial}{\partial x}(x \lor y) = (1 + \exp(y-x))^{-1}$ and $\frac{\partial}{\partial y}(x \lor y) = (\exp(x-y) + 1)^{-1}$, it guarantees that at least one parameter will be effectively learned during backpropagation. That is, no worry of gradient vanishing problem.
It provides some degree of non-linearity on its own, so it doesn't need a separate activation function.
For experimental purposes, I planned to actually use this operator for neural networks I've built. However, I also think there is a problem for this operator. Namely, directly invoking exp and log functions that most programming languages provide will be very prone to floating-point overflow, or floating-point error. So how should I compute this operator safely?
Without loss of generality assume $x>y$. Then $$ x \vee y = x + \log(1+\exp(y-x)). $$ This should already fix the biggest problems.