Our equation for negative log likelihood loss function for logistic regression with maximum likelihood is: $L(\beta) = -\sum^{n}_{i=1} log P(y_i|x_i) + \lambda \vert \vert \beta \vert \vert^2$
And it's derivative is: $\Delta L(\beta) = \frac{\partial L(\beta)^T}{\partial \beta} = \sum^{n}_{i=1}(p_i - y_i)\phi(x_i)+2\lambda\textbf{I}\beta$
I am trying to understand what does the identity matrix do there. So far I thought, if you multiply the identity matrix with a scalar that would give you the scalar "embedded" diagonally in the matrix. But isn't elementwise multiplication with a scalar doing the same thing as multiplication with such a matrix? What is the reason for including it in this equation then?
This is a ridge logistic regression. Let us look at the linear case, the target function is $$ \arg \max_{\beta}\left( \|\mathbf{y - X\beta}\|^2_2+\lambda\|\beta\|_2^2 \right) $$ and its gradient is given by $$ -2\mathbf{X^T(y-X\beta)} + 2\lambda \mathbf{I}\beta\propto \mathbf{X ^ T y - X^TX\beta}+\lambda \mathbf{I} \beta=0 $$ hence, the ridge estimator is $$ \hat{\beta} = \mathbf{( X ^ T X + \lambda I)^{-1}X^Ty }. $$ Namely, without the identity $p\times p$ matrix the addition $$ \mathbf{X ^ T X + \lambda I} $$ is not well defined. Same logic holds for the logistic regression with the ML estimators. Just replace in the gradient the $\mathbf{X\beta}$ with the vector $\pi$ - the expected value of $\mathbf{y}$ in the logistic model.