I am trying to find the $L$-smoothness constant of the following function (logistic regression cost function) in order to run gradient descent with an appropriate stepsize.
The function is given as $f(x)=-\frac{1}{m} \sum_{i=1}^m\left(y_i \log \left(s\left(a_i^{\top} x\right)\right)+\left(1-y_i\right) \log \left(1-s\left(a_i^{\top} x\right)\right)\right)+\frac{\gamma}{2}\|x\|^2$ where $a_i \in \mathbb{R}^n, y_i \in\{0,1\}$,$s(z)=\frac{1}{1+\exp (-z)}$ is the sigmoid function.
The gradient is given as $\nabla f(x)=\frac{1}{m} \sum_{i=1}^m a_i\left(s\left(a_i^{\top} x\right)-y_i\right)+\gamma x $.
My ideas was that the smoothness constant $L$ has to be bigger than all the eigenvalues of the hermitian of the given function, this follows from the fact that if $f$ is $L$-smooth, $g(x)=\frac{L}{2} x^T x-f(x)$ is a convex function and therefore the hessian has to be positive semi-definite. The second-order partial derivatives of $f$ are given as
$ \frac{\partial^2 }{\partial x_k \partial x_j}f(x)=\frac{1}{m} \sum_{i=1}^ms(a_i^{\top} x)\left(1-s(a_i^{\top} x)\right)[a_i]_k[a_i]_j+\gamma\delta_{ij} $
from the following github post (https://github.com/ymalitsky/adaptive_GD/blob/master/logistic_regression.ipynb) i know that $ L=\frac{1}{4} \lambda_{\max }\left(A^{\top} A\right)+\gamma$ , where $\lambda_{\max }$ denotes the largest eigenvalue, which seems good since i figured out that $s(a_i^{\top} x)\left(1-s(a_i^{\top} x)\right)\leq \frac{1}{4}$ for all $x$.
But i am not able to fit everything together. I would appreciate any help.
Here's my idea:
Given the Hessian matrix (follow your notation): \begin{equation} \begin{aligned} \nabla^2 f(x) &= \frac{1}{m}\sum_{i=1}^{m}s(-y_i a_i^Tx)(1-s(-y_i a_i^Tx))a_ia_i^T + \gamma I_d. \end{aligned} \end{equation}
To show $f(x)$ is $L$-smooth, we apply Theorem 5.12 in [1][p. 114]: if $||\nabla^2 f(x)||_2\leq L$ for any $x\in\mathbb{R}^d$, then $f(x)$ is $L-$smooth. We then provide the value of $L$: \begin{equation} \begin{aligned} ||\nabla^2 f(x)||_2 &\leq ||\frac{1}{m}\sum_{i=1}^{m}\frac{1}{4}a_ia_i^T + \gamma I_d||_2 & \text{by } s(t)(1-s(t)\leq \frac{1}{4} \,\forall t\in\mathbb{R} \\ &\leq\frac{1}{4m}||A^TA||_2 + \gamma & \text{by triangle inequality}\\ &\leq \frac{\lambda_{max}}{4m} + \gamma & \text{by }\frac{||A^TAx||^2_2}{||x||^2_2}\leq\lambda_{max}^2\\ &\leq\frac{\lambda_{max}}{4}+\gamma. \end{aligned} \end{equation} Therefore, $f(x)$ is $L$-smooth with $L=\frac{\lambda_{max}}{4}+\gamma$. $\blacksquare$
[1] Beck, A. (2017). First-order methods in optimization. Philadelphia, PA: Society for Industrial and Applied Mathematics.