Logit Gradient/Hessian derivations

69 Views Asked by At

I'm trying to follow the algebra leading from the gradient function to the Hessian in Logistic Regression, but I can't quite understand where I have gone wrong.

I have the gradient function as: $$ \sum_i^m \left[x_i\cdot \frac {\exp\{-\theta^T x_i\}} {1+\exp\{-\theta^T x_i\}} - (y_i-1) x_i \right] $$

After deriving the gradient, I'm trying to rearrange terms and simplify the expression...

First, we can note that the logistic regression model can be expressed as $$ p(y=1|x; \theta) = \frac{1}{1+exp(-\theta^Tx_i)} $$ Similarly, we can also show that $$ p(y=0|x; \theta) = 1-\frac{1}{1+exp(-\theta^Tx_i)} = \frac{exp(-\theta^Tx_i)}{1+exp(-\theta^Tx_i)} $$

These expressions rely on the definition of the logistic/sigmoid function $\sigma(\theta^Tx_i) = \frac {1} {1+\exp(-\theta^Tx_i)}$ and let us state more concisely $\frac {exp(\theta^Tx_i)} {1+\exp(\theta^Tx_i)} = 1-\sigma(\theta^Tx_i)$

So the problem we are trying to solve goes from: $$ \sum_i^m \left[x_i\cdot \frac {\exp\{-\theta^T x_i\}} {1+\exp\{-\theta^T x_i\}} - (y_i-1) x_i \right] $$

... and becomes - $$ \sum_i^m \left[x_i\cdot (1-\sigma(\theta^T x_i)) - (y_i-1) x_i \right] $$ Factor out the vector, x_i $$ \sum_i^m \left[ \left((1-\sigma(\theta^T x_i)) - (y_i-1)\right) x_i \right] $$

Resolve terms $$ \sum_i^m \left[ \left(1-\sigma(\theta^T x_i) - y_i+1)\right) x_i \right] $$ $$ \sum_i^m \left[ \left(\sigma(\theta^T x_i) - y_i + 2)\right) x_i \right] $$

So, I have this 2 floating around. Did I do something wrong or is this just a term that vanishes when the hessian is calculated? I can't find any examples that are helping to clarify this for me.

2

There are 2 best solutions below

2
On BEST ANSWER

Define a few variables $$\eqalign{ \def\c#1{\color{red}{#1}} \def\t{\theta} \def\p{\partial} \def\l{\lambda} \def\qiq{\quad\implies\quad} \def\o{{\tt1}} z &= X\t &&\qiq &dz = X\,d\t \\ e &= \exp(z), \;&E = {\rm Diag}(e) &\qiq &de = e\odot dz= E\,dz \\ p &= \frac{e}{\o+e}, &P = {\rm Diag}(p) &\qiq &dp = (P-P^2)\,dz \\ &&&\qiq&\;\,p=P\o \\ }$$ Write the cross-entropy $(\l)$ in terms of these variables, then calculate its gradient $$\eqalign{ -\l &= y:\log(p) \;+\; (\o-y):\log(\o-p) \\ -d\l &= y:P^{-1}\,\c{dp} \;+\; (\o-y):(I-P)^{-1}\,(-dp) \\ &= y:P^{-1}\,\c{(P-P^2)\,dz} \;+\; (y-\o):(I-P)^{-1}(P-P^2)\,dz \\ &= y:(I-P)\,dz \;+\; (y-\o):P\,dz \\ &= (y-Py):\c{dz} \;+\; (Py-\c{P\o}):dz \\ -d\l &= (y-\c{p}):\c{X\,d\t} \\ d\l &= X^T(p-y):d\t \\ g &=\frac{\p\l}{\p\t} = X^T(p-y) \\ }$$ The Hessian is much easier to calculate, since everything except $p$ is a constant $$\eqalign{ &dg = X^T\c{dp} \;=\; X^T(P-P^2)\,X\,d\t \\ &H\;=\;\frac{\p g}{\p\t} \;=\; X^T(P-P^2)\,X \\ }$$


In the above, all functions are applied elementwise and a colon denotes the matrix inner product $$ A:B \;=\; {\rm trace}(A^TB) \;=\; B:A $$

Here's a quick derivation of the Logistic function (using elementwise multiplication) $$\eqalign{ dp &= \frac{(\o+e)\odot\c{de} - e\odot de}{(\o+e)\odot(\o+e)} \\ &= \frac{(\o+e)\odot\c{e\odot dz} - e\odot e\odot dz}{(\o+e)\odot(\o+e)} \\ &= \Big[p\odot dz - p\odot p\odot dz\Big] \\ &= \left(P-P^2\right)dz }$$

0
On

I assume this is a binary logistic regression (a good reference on this is C. M. Bishop, "Pattern Recognition and Machine Learning", Chapter 4, pp.205-206). We are given some feature vectors which we denote as $x_{ij}$ ($i$ indexes vectors, $j$ indexes features). We are trying to calibrate $\theta_j$; the probability of feature vector $x_{ij}$ being part of class 1 is, $$ p_i = \frac{\exp\left(\sum_j \theta_j x_{ij}\right)} {1+ \exp\left(\sum_j \theta_j x_{ij}\right)} $$ If we use the cross-entropy error function, \begin{equation} E = - \sum_i \left( t_i \ln p_i + (1- t_i) \ln (1-p_i) \right) \label{CE} \tag{CE} \end{equation} where $t_i \in \{0,1\}$ then since, $$ \ln p_i = \sum_j \theta_j x_{ij} - \ln \left( 1+ \exp\left(\sum_j \theta_j x_{ij}\right) \right) $$ and, $$ \ln (1-p_i) = - \ln \left( 1+ \exp\left(\sum_j \theta_j x_{ij}\right) \right) $$ we can rewrite (\ref{CE}) as, $$ E = -\sum_i \left( t_i\sum_j \theta_j x_{ij} - \ln \left( 1+ \exp\left(\sum_j \theta_j x_{ij}\right) \right) \right) $$ It follows that, $$ \frac{\partial E}{\partial \theta_j} = \sum_i (p_i-t_i) x_{ij} $$ given that, \begin{align} \frac{\partial}{\partial \theta_j} \ln \left( 1+ \exp\left(\sum_j \theta_j x_{ij}\right) \right) & = \frac{1}{1+ \exp\left(\sum_j \theta_j x_{ij}\right)} \frac{\partial}{\partial \theta_j} \left( 1+ \exp\left(\sum_j \theta_j x_{ij}\right) \right) \\ & = \frac{\exp\left(\sum_j \theta_j x_{ij}\right)}{1+ \exp\left(\sum_j \theta_j x_{ij}\right)} x_{ij} \\ & =p_i x_{ij} \end{align} Since $t_i$ and $x_{ij}$ are not functions of $\theta$, $$ \frac{\partial^2 E}{\partial \theta_j \partial \theta_k} =\sum_i x_{ij} \frac{\partial p_i}{\partial \theta_k} $$ Using, $$ \frac{d}{dx} \frac{e^x}{1+e^x}= \frac{e^x}{1+e^x} \frac{1}{1+e^x} $$ we have, $$ \frac{\partial^2 E}{\partial \theta_j \partial \theta_k} =\sum_i x_{ij} x_{ik} p_i (1-p_i) $$