Log-likelihood gradient and Hessian

16.9k Views Asked by At

Considering a binary classification problem with data $D = \left\{ (x_i, y_i) \right\}_{i=1}^n$, $x_i \in \mathbb{R}^d$ and $y_i \in \{0,1\}$. Given the following definitions,

$$f(x) = x^T \beta$$

$$p(x) = \sigma(f(x)) \quad \text{with} \quad\sigma(z) = 1/(1 + e^{-z})$$

$$L(\beta) = \sum_{i=1}^n \Bigl[ y_i \log p(x_i) + (1 - y_i) \log [1 - p(x_i)] \Bigr]$$

where $\beta \in \mathbb{R}^d$ is a vector. $p(x)$ is a short-hand for $p(y = 1\ |\ x)$. The task is to compute the derivative $\frac{\partial}{\partial \beta} L(\beta)$. A tip is to use the fact $$\frac{\partial}{\partial z} \sigma(z) = \sigma(z) (1 - \sigma(z))$$


So here is my approach so far:

\begin{align*} L(\beta) & = \sum_{i=1}^n \Bigl[ y_i \log p(x_i) + (1 - y_i) \log [1 - p(x_i)] \Bigr]\\ \frac{\partial}{\partial \beta} L(\beta) & = \sum_{i=1}^n \Bigl[ \Bigl( \frac{\partial}{\partial \beta} y_i \log p(x_i) \Bigr) + \Bigl( \frac{\partial}{\partial \beta} (1 - y_i) \log [1 - p(x_i)] \Bigr) \Bigr]\\ \end{align*}

\begin{align*} \frac{\partial}{\partial \beta} y_i \log p(x_i) &= (\frac{\partial}{\partial \beta} y_i) \cdot \log p(x_i) + y_i \cdot (\frac{\partial}{\partial \beta} p(x_i))\\ &= 0 \cdot \log p(x_i) + y_i \cdot (\frac{\partial}{\partial \beta} p(x_i))\\ &= y_i \cdot (p(x_i) \cdot (1 - p(x_i))) \end{align*}

\begin{align*} \frac{\partial}{\partial \beta} (1 - y_i) \log [1 - p(x_i)] &= (1 - y_i) \cdot (\frac{\partial}{\partial \beta} \log [1 - p(x_i)])\\ & = (1 - y_i) \cdot \frac{1}{1 - p(x_i)} \cdot p(x_i) \cdot (1 - p(x_i))\\ & = (1 - y_i) \cdot p(x_i) \end{align*}

$$\frac{\partial}{\partial \beta} L(\beta) = \sum_{i=1}^n \Bigl[ y_i \cdot (p(x_i) \cdot (1 - p(x_i))) + (1 - y_i) \cdot p(x_i) \Bigr]$$

So basically I used the product and chain rule to compute the derivative. I am afraid, that my solution is wrong, because on page 120 of The Elements of Statistical Learning it says the gradient is

$$\sum_{i = 1}^N x_i(y_i - p(x_i;\beta))$$

I don't know what could have possibly gone wrong. Any advice on this?

2

There are 2 best solutions below

2
On BEST ANSWER

So, if $p(x)=\sigma(f(x))$ and $\frac{d}{dz}\sigma(z)=\sigma(z)(1-\sigma(z))$, then

$$\frac{d}{dz}p(z) = p(z)(1-p(z)) f'(z) \; .$$

This changes everyting and you should arrive at the correct result this time.

In particular,

$$\frac{d}{dz}\log p(z) = (1-p(z)) f'(z)$$

and

$$\frac{d}{dz}\log (1-p(z)) = -p(z) f'(z) \; .$$

0
On

In your third line, while differentiating you missed out $1/p(x_i)$ which is the derivative of $\log(p(x_i))$.

\begin{eqnarray} d/db(y_i \cdot \log p(x_i)) &=& \log p(x_i) \cdot 0 + y_i \cdot(d/db(\log p(x_i))\\ &=& y_i \cdot 1/p(x_i) \cdot d/db(p(x_i)) \end{eqnarray}

Note that $d/db(p(xi)) = p(x_i)\cdot {\bf x_i} \cdot (1-p(x_i))$ and not just $p(x_i) \cdot(1-p(x_i))$.

Also in 7th line you missed out the $-$ sign which comes with the derivative of $(1-p(x_i))$.