I'm trying to follow the algebra leading from the gradient function to the Hessian in Logistic Regression, but I can't quite understand where I have gone wrong.
I have the gradient function as: $$ \sum_i^m \left[x_i\cdot \frac {\exp\{-\theta^T x_i\}} {1+\exp\{-\theta^T x_i\}} - (y_i-1) x_i \right] $$
After deriving the gradient, I'm trying to rearrange terms and simplify the expression...
First, we can note that the logistic regression model can be expressed as $$ p(y=1|x; \theta) = \frac{1}{1+exp(-\theta^Tx_i)} $$ Similarly, we can also show that $$ p(y=0|x; \theta) = 1-\frac{1}{1+exp(-\theta^Tx_i)} = \frac{exp(-\theta^Tx_i)}{1+exp(-\theta^Tx_i)} $$
These expressions rely on the definition of the logistic/sigmoid function $\sigma(\theta^Tx_i) = \frac {1} {1+\exp(-\theta^Tx_i)}$ and let us state more concisely $\frac {exp(\theta^Tx_i)} {1+\exp(\theta^Tx_i)} = 1-\sigma(\theta^Tx_i)$
So the problem we are trying to solve goes from: $$ \sum_i^m \left[x_i\cdot \frac {\exp\{-\theta^T x_i\}} {1+\exp\{-\theta^T x_i\}} - (y_i-1) x_i \right] $$
... and becomes - $$ \sum_i^m \left[x_i\cdot (1-\sigma(\theta^T x_i)) - (y_i-1) x_i \right] $$ Factor out the vector, x_i $$ \sum_i^m \left[ \left((1-\sigma(\theta^T x_i)) - (y_i-1)\right) x_i \right] $$
Resolve terms $$ \sum_i^m \left[ \left(1-\sigma(\theta^T x_i) - y_i+1)\right) x_i \right] $$ $$ \sum_i^m \left[ \left(\sigma(\theta^T x_i) - y_i + 2)\right) x_i \right] $$
So, I have this 2 floating around. Did I do something wrong or is this just a term that vanishes when the hessian is calculated? I can't find any examples that are helping to clarify this for me.
Define a few variables $$\eqalign{ \def\c#1{\color{red}{#1}} \def\t{\theta} \def\p{\partial} \def\l{\lambda} \def\qiq{\quad\implies\quad} \def\o{{\tt1}} z &= X\t &&\qiq &dz = X\,d\t \\ e &= \exp(z), \;&E = {\rm Diag}(e) &\qiq &de = e\odot dz= E\,dz \\ p &= \frac{e}{\o+e}, &P = {\rm Diag}(p) &\qiq &dp = (P-P^2)\,dz \\ &&&\qiq&\;\,p=P\o \\ }$$ Write the cross-entropy $(\l)$ in terms of these variables, then calculate its gradient $$\eqalign{ -\l &= y:\log(p) \;+\; (\o-y):\log(\o-p) \\ -d\l &= y:P^{-1}\,\c{dp} \;+\; (\o-y):(I-P)^{-1}\,(-dp) \\ &= y:P^{-1}\,\c{(P-P^2)\,dz} \;+\; (y-\o):(I-P)^{-1}(P-P^2)\,dz \\ &= y:(I-P)\,dz \;+\; (y-\o):P\,dz \\ &= (y-Py):\c{dz} \;+\; (Py-\c{P\o}):dz \\ -d\l &= (y-\c{p}):\c{X\,d\t} \\ d\l &= X^T(p-y):d\t \\ g &=\frac{\p\l}{\p\t} = X^T(p-y) \\ }$$ The Hessian is much easier to calculate, since everything except $p$ is a constant $$\eqalign{ &dg = X^T\c{dp} \;=\; X^T(P-P^2)\,X\,d\t \\ &H\;=\;\frac{\p g}{\p\t} \;=\; X^T(P-P^2)\,X \\ }$$
In the above, all functions are applied elementwise and a colon denotes the matrix inner product $$ A:B \;=\; {\rm trace}(A^TB) \;=\; B:A $$
Here's a quick derivation of the Logistic function (using elementwise multiplication) $$\eqalign{ dp &= \frac{(\o+e)\odot\c{de} - e\odot de}{(\o+e)\odot(\o+e)} \\ &= \frac{(\o+e)\odot\c{e\odot dz} - e\odot e\odot dz}{(\o+e)\odot(\o+e)} \\ &= \Big[p\odot dz - p\odot p\odot dz\Big] \\ &= \left(P-P^2\right)dz }$$