I have read 2 papers (this and this) about the convergence of the Adam optimizer. One of the assumptions is the smoothness of the loss function, meaning that the gradient of the loss function is Lipschitz continuous. Let's consider a neural network $f(\theta)$ with a loss function $L$. If I want to prove the smoothness of the loss function, does it imply that I have to derive the calculation about the gradient of the loss function w.r.t every weight (do the backpropagation) and prove that the Lipschitz inequality holds?
Also, does this assumption depends on the network architecture?