Considering a binary classification problem with data $D = \left\{ (x_i, y_i) \right\}_{i=1}^n$, $x_i \in \mathbb{R}^d$ and $y_i \in \{0,1\}$. Given the following definitions,
$$f(x) = x^T \beta$$
$$p(x) = \sigma(f(x)) \quad \text{with} \quad\sigma(z) = 1/(1 + e^{-z})$$
$$L(\beta) = \sum_{i=1}^n \Bigl[ y_i \log p(x_i) + (1 - y_i) \log [1 - p(x_i)] \Bigr]$$
where $\beta \in \mathbb{R}^d$ is a vector. $p(x)$ is a short-hand for $p(y = 1\ |\ x)$. The task is to compute the derivative $\frac{\partial}{\partial \beta} L(\beta)$. A tip is to use the fact $$\frac{\partial}{\partial z} \sigma(z) = \sigma(z) (1 - \sigma(z))$$
So here is my approach so far:
\begin{align*} L(\beta) & = \sum_{i=1}^n \Bigl[ y_i \log p(x_i) + (1 - y_i) \log [1 - p(x_i)] \Bigr]\\ \frac{\partial}{\partial \beta} L(\beta) & = \sum_{i=1}^n \Bigl[ \Bigl( \frac{\partial}{\partial \beta} y_i \log p(x_i) \Bigr) + \Bigl( \frac{\partial}{\partial \beta} (1 - y_i) \log [1 - p(x_i)] \Bigr) \Bigr]\\ \end{align*}
\begin{align*} \frac{\partial}{\partial \beta} y_i \log p(x_i) &= (\frac{\partial}{\partial \beta} y_i) \cdot \log p(x_i) + y_i \cdot (\frac{\partial}{\partial \beta} p(x_i))\\ &= 0 \cdot \log p(x_i) + y_i \cdot (\frac{\partial}{\partial \beta} p(x_i))\\ &= y_i \cdot (p(x_i) \cdot (1 - p(x_i))) \end{align*}
\begin{align*} \frac{\partial}{\partial \beta} (1 - y_i) \log [1 - p(x_i)] &= (1 - y_i) \cdot (\frac{\partial}{\partial \beta} \log [1 - p(x_i)])\\ & = (1 - y_i) \cdot \frac{1}{1 - p(x_i)} \cdot p(x_i) \cdot (1 - p(x_i))\\ & = (1 - y_i) \cdot p(x_i) \end{align*}
$$\frac{\partial}{\partial \beta} L(\beta) = \sum_{i=1}^n \Bigl[ y_i \cdot (p(x_i) \cdot (1 - p(x_i))) + (1 - y_i) \cdot p(x_i) \Bigr]$$
So basically I used the product and chain rule to compute the derivative. I am afraid, that my solution is wrong, because on page 120 of The Elements of Statistical Learning it says the gradient is
$$\sum_{i = 1}^N x_i(y_i - p(x_i;\beta))$$
I don't know what could have possibly gone wrong. Any advice on this?
So, if $p(x)=\sigma(f(x))$ and $\frac{d}{dz}\sigma(z)=\sigma(z)(1-\sigma(z))$, then
$$\frac{d}{dz}p(z) = p(z)(1-p(z)) f'(z) \; .$$
This changes everyting and you should arrive at the correct result this time.
In particular,
$$\frac{d}{dz}\log p(z) = (1-p(z)) f'(z)$$
and
$$\frac{d}{dz}\log (1-p(z)) = -p(z) f'(z) \; .$$