For the loss function of logistic regression $$ \ell = \sum_{i=1}^n \left[ y_i \boldsymbol{\beta}^T \mathbf{x}_{i} - \log \left(1 + \exp( \boldsymbol{\beta}^T \mathbf{x}_{i} \right) \right] $$ I understand that its first order derivative is $$ \frac{\partial \ell}{\partial \beta} = \boldsymbol{X}^T(\boldsymbol{y} - \boldsymbol{p}) $$ where $$ p = \frac{exp(\boldsymbol{X} \cdot \beta)}{1 + exp(\boldsymbol{X} \cdot \beta)} $$ and its second order derivative is
$$ \frac{\partial^2 \ell}{\partial \beta^2} = \boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X} $$ where $\boldsymbol{W}$ is a $n*n$ diagonal matrix and the $i-th$ diagonal element of $\boldsymbol{W}$ is equal to $p_i(1-p_i)$. However, I am struggling with the first order and second order derivative of the loss function of logistic regression with L2 regularization
$$ \ell = \sum_{i=1}^n \left[ y_i \boldsymbol{\beta}^T \mathbf{x}_{i} - \log \left(1 + \exp( \boldsymbol{\beta}^T \mathbf{x}_{i} \right) \right] + \lambda \Sigma_{j}^{p}\beta_j^2 $$
I try to extrapolate $\boldsymbol{X}^T(\boldsymbol{y} - \boldsymbol{p})$ and $\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X}$ by simply adding one more term according to my meager knowledge of calculus, making them $\boldsymbol{X}^T(\boldsymbol{y} - \boldsymbol{p}) + 2\lambda\boldsymbol{\beta}$ and $\boldsymbol{X}^T\boldsymbol{W}\boldsymbol{X} + 2\lambda$
But it appears to me that the thing does not work this way. So what is the correct 1st and 2nd order derivative of the loss function for the logistic regression with L2 regularization?
$\def\D{{\rm Diag}}\def\o{{\tt1}}\def\p#1#2{\frac{\partial #1}{\partial #2}}$You have expressions for a loss function and its the derivatives (gradient, Hessian) $$\eqalign{ \ell &= y:X\beta - \o:\log\left(e^{Xb}+\o\right) \\ g_{\ell} &= \p{\ell}{\beta} = X^T(y-p) \qquad&{\rm where}\;\;p = \sigma(Xb) \\ H_{\ell} &= \p{g_{\ell}}{\beta} = -X^T\left(P-P^2\right)X \qquad&{\rm where}\;\,P = \D(p) \\ }$$ and now you want to add regularization. So let's do that $$\eqalign{ \mu &= \ell + \lambda\big\|\beta\big\|_F^2 \\ &= \ell + \lambda\beta:\beta \\ d\mu &= d\ell + 2\lambda\beta:d\beta \\ &= (g_{\ell}:d\beta) + (2\lambda\beta:d\beta) \\ &= (g_{\ell} + 2\lambda\beta):d\beta \\ g_\mu &= \p{\mu}{\beta} = g_{\ell} + 2\lambda\beta \\\\ dg_\mu &= dg_{\ell} + 2\lambda\,d\beta \\ &= H_{\ell}\,d\beta + 2\lambda I\,d\beta \\ &= \left(H_{\ell} + 2\lambda I\right)d\beta \\ H_\mu &= \p{g_\mu}{\beta} = H_\ell + 2\lambda I \\\\ }$$
In the above, a colon is used to denote the trace/Frobenius product, i.e. $$\eqalign{ A:B = {\rm Tr}(A^TB) \\ A:A = \big\|A\big\|_F^2 \\ }$$ when $(A,B)$ are vectors this definition corresponds to the standard dot product.
The Frobenius product inherits nice algebraic properties from the trace function, e.g. $$\eqalign{ A:B &= B:A = B^T:A^T \\ CA:B &= C:BA^T = A:C^TB \\ }$$ It also has nice behavior under differentiation $$\eqalign{ d(A:B) &= dA:B + A:dB \\ d(A:A) &= dA:A + A:dA \\ &= A:dA + A:dA \\ &= 2A:dA \\ }$$