Assuming for simplicity a neural network with 1 parameter. Let $x \in R$ be a training pattern, $t \in R$ the target variable, $w \in R$ the parameter and $g: R \rightarrow R$ the activation function. Given the regularized loss function:
$$ f(x;w) = \frac{1}{2}(t - g(xw))^2 + \frac{1}{2} \lambda w^2$$
The activation function $g$ can be linear, sigmoid, tanh or ReLU.
Depending on the choice of $g$, are $f$, $\nabla f$ and $\nabla^2 f$ Lipschitz-continuous?
ps:
I need to check whether some assumptions of optimization algorithms are true, so I think they require global Lipschitz continuity.
I answer my question assuming a linear activation function $g(wx) = wx$ using the theorem:
$f: I \rightarrow R$ is Lipschitz in I if and only if $\nabla f$ is bounded in I.
$$f(x;w) = \frac{1}{2} (t - wx)^2 + \frac{1}{2} \lambda w^2$$
Deriving with respect to $w$...
$$\nabla f(x;w) = - tx + w(x^2 + \lambda) $$ $$\nabla^2 f(x;w) = x^2 + \lambda$$ $$\nabla^3 f(x;w) = 0$$
So $\nabla f$ is not bounded as it goes to $\infty$, therefore $f$ is not Lipschitz. $\nabla^2 f$ is a constant (because $x$ and $\lambda$ are fixed), therefore $\nabla f$ is Lipschitz. Likewise $\nabla^2 f$ is Lipschitz because the derivative is bounded.
I hope at least this is correct and that there is a method to check the others less explicitly.