I am trying to derive the derivative of the loss function of a logistic regression model.
Instead of 0 and 1, y can only hold the value of 1 or -1, so the loss function is a little bit different.
The following is how I did it. The answers I found online were all a little bit different from mine. I'd be grateful if anyone could help and see if I did something wrong!
\begin{align*} L &= -\sum_{n=1}^{N} \log \sigma\left( t_n (w^\top x_n + w_0) \right)\\ \frac{dL}{dw} &=-\frac{d}{dw}\sum_{n=1}^{N} \log \sigma\left( t_n (w^\top x_n + w_0) \right)\\ &=-\sum_{n=1}^{N} \frac{d}{dw}\log \sigma\left( t_n (w^\top x_n + w_0) \right)\\ &=-\frac{d}{dw}\log \sigma\left( w^{'\top} X^{'\top} T \right) \end{align*} where $w^{'} = \begin{bmatrix} w \\ w_0 \end{bmatrix}$, and $x^{'}= \begin{bmatrix} -x_1^\top-, 1\\ ...\\ -x_n^\top-, 1\\ \end{bmatrix}$
Now let $A(x)=log(x)$, $B(x)=\sigma(x)$, $C(x)= w^{'\top} X^{'\top}T$ Then, \begin{align*} \frac{dL}{dw^{'}}&=\frac{dA(B)}{dB} \times \frac{dB(C)}{dC} \times \frac{dC}{dw^{'}}\\ &=\frac{1}{B} \times \sigma(C)(1-\sigma(C)) \times \frac{dC}{dw^{'}}\\ &=(1-\sigma(C)) \times X^{'\top}T\\ &=(1-\sigma(w^{'\top} X^{'\top}T)) \times X^{'\top}T \end{align*}
$ \def\b{\omega_0}\def\s{\sigma}\def\o{{\tt1}}\def\p{\partial} \def\L{{\cal L}} \def\LR#1{\left(#1\right)} \def\diag#1{\operatorname{diag}\LR{#1}} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} $Let $\o$ denote the all-ones vector and $T$ a diagonal matrix constructed from the $\{t_n\}$ values and define the vector $$\eqalign{ p &= T\LR{Xw+\o\,\b} \qiq dp = TXdw \\ }$$ and the elementwise logistic function $$\eqalign{ s &= \s(p) = \frac{e^p}{\o+e^p} \\ S &= \Diag{s} &\qiq ds=\LR{S-S^2}dp \\ }$$ Use the above notation to rewrite the loss function and calculate its gradient. $$\eqalign{ L &= -\o:\log(s) \\ dL &= -\o:S^{-1}ds \\ &= -\o:S^{-1}\LR{S-S^2}dp \\ &= \LR{S-I}\o:dp \\ &= \LR{s-\o}:\LR{TXdw} \\ &= X^TT\LR{s-\o}:dw \\ \grad{L}{w} &= X^TT\LR{s-\o} \\ }$$ This almost matches your result, except for the overall sign and the order of the factors.
In some of the steps above a colon is used to denote the matrix inner product $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_F \\ }$$ The properties of the underlying trace function allow the terms in such a product to be rearranged in many different but equivalent ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A \\&= \LR{A^TC}:B \\ }$$ Also note that my definition of the $X$ matrix omits the rightmost column of ${\large\tt1}s\:$ since $\,\b$ drops out of the derivative anyway.