role of the identity matrix in gradient of negative log likelihood loss function

457 Views Asked by At

Our equation for negative log likelihood loss function for logistic regression with maximum likelihood is: $L(\beta) = -\sum^{n}_{i=1} log P(y_i|x_i) + \lambda \vert \vert \beta \vert \vert^2$

And it's derivative is: $\Delta L(\beta) = \frac{\partial L(\beta)^T}{\partial \beta} = \sum^{n}_{i=1}(p_i - y_i)\phi(x_i)+2\lambda\textbf{I}\beta$

I am trying to understand what does the identity matrix do there. So far I thought, if you multiply the identity matrix with a scalar that would give you the scalar "embedded" diagonally in the matrix. But isn't elementwise multiplication with a scalar doing the same thing as multiplication with such a matrix? What is the reason for including it in this equation then?

1

There are 1 best solutions below

1
On BEST ANSWER

This is a ridge logistic regression. Let us look at the linear case, the target function is $$ \arg \max_{\beta}\left( \|\mathbf{y - X\beta}\|^2_2+\lambda\|\beta\|_2^2 \right) $$ and its gradient is given by $$ -2\mathbf{X^T(y-X\beta)} + 2\lambda \mathbf{I}\beta\propto \mathbf{X ^ T y - X^TX\beta}+\lambda \mathbf{I} \beta=0 $$ hence, the ridge estimator is $$ \hat{\beta} = \mathbf{( X ^ T X + \lambda I)^{-1}X^Ty }. $$ Namely, without the identity $p\times p$ matrix the addition $$ \mathbf{X ^ T X + \lambda I} $$ is not well defined. Same logic holds for the logistic regression with the ML estimators. Just replace in the gradient the $\mathbf{X\beta}$ with the vector $\pi$ - the expected value of $\mathbf{y}$ in the logistic model.