Hessian of Loss function ( Applying Newton's method in Logistic Regression )

3.1k Views Asked by At

If Cost function is L , $$ L=−(\frac{1}{m})(y(log(h(x))+(1−y)( log(1−h(x) ) ) $$ $$ h(x)=\frac{1}{1+e^{-(w^{T}x+b)}} $$ First order partial deriavative of L with respect to w is , $$\frac{\partial L}{\partial w} = - ( \frac{1}{m} ) ( h(w) - y )x $$
Question :
how do i find the second order partial derivative of L with respect to w ?, that is $$ \frac{\partial ^{2}L}{\partial w^{2}}$$
So that i can compute the error gradient by using Newton's method and update Weights $ w $, like this $$ w_{new} = w_{old} - (\frac{\partial ^{2}L}{\partial w^{2}})^{-1} \ ( \frac{\partial L}{\partial w}) $$
Am just trying to figure out how Newton's method works with logistic regression.

3

There are 3 best solutions below

3
On BEST ANSWER

First of all $f(x)$ has to satisfy the condition where its hessian has to be $$\mathbb R^{n} \to \mathbb R^{1}$$ Meaning that $f(x)$ has to be twice differentiable and it is positive semi-definite.

we already know from here that ,

$$ \frac{\partial L} { \partial w} = (h_\theta(x) - y)x $$ $$ \sigma(x) = \frac{1}{1+e^{-(w^Tx+b)}}$$ Refer here for proof on first deriavative of $ \sigma(x)$ , $$ \sigma^{'}(x) = \sigma(x)(1-\sigma(x)) $$

Please note that here $ h_\theta(x) $ and $ \sigma(x) $ are one and the same , i just used $ \sigma(x)$ for representation sake.

Now we need to find $ \frac{\partial^2 L}{ \partial w^2} $ , $$ \begin{align*} \frac{\partial^2 L}{ \partial w^2} &= \frac{\partial L}{\partial w}(xh_\theta(x) - xy) \\ \\ &= x^2 \frac{\partial L}{\partial w} (h_\theta(x)) \ \ \ \ \ \ \ \ \ [ \ h_\theta^{'}(x) = \sigma^{'}(x) \ ] \\ \\ &= x^2 ( \sigma(x) - \sigma(x)^2) \\ \\ & (or) \\ \\ &= x .( \sigma(x) - \sigma(x)^2).x^{T} \end{align*} $$ Am hoping that am correct !

7
On

It is not so clear that you get these concepts.

You should clarify inputs and outputs. What you seem to have done is calculated second derivative of a scalar valued function of one variable. In other words : $$\mathbb R^{1} \to \mathbb R^{1}$$ function. Jacobians take all different partial differentials with respect to all different input variables. For a function $$\cases{x \in \mathbb R^n\\f(x) \in \mathbb R^m}$$

you get an output that is a $n\times m$ matrix.

For a Hessian to be a matrix we would need for a function $f(x)$ to be $$\mathbb R^{n} \to \mathbb R^{1}$$

the more general case

$$\mathbb R^{n} \to \mathbb R^{m}$$

it will be a 3 indexed tensor.

9
On

For the second derivative, you could do is faster $$\sigma' (x)= \sigma(x)(1-\sigma(x))=\sigma(x)-\sigma^2(x)$$ $$\sigma'' (x)=\sigma' (x)-2 \sigma (x)\sigma' (x)=\sigma' (x)(1-2\sigma (x))=\sigma(x)(1-\sigma(x))(1-2\sigma (x))$$ which is not what you obtain.

Edit

In order to check the result, let us use the second-order central derivative $$f''(x) = \frac{f(x+h) - 2 f(x) + f(x-h)}{h^{2}}$$ at $x=\frac 12$ and $h=\frac 1 {200}$.

This would give $-0.0575566$ while the formula I wrote gives $-0.0575568$; your formula leads to $0.292561$.

At least, we do not agree.