I want to compute $\nabla_wJ(\mathbf{w})$ of $J(\mathbf{w}) = -\mathbf{y}\cdot ln(s(\mathbf{Xw})) - (\mathbf{1} - \mathbf{y})\cdot ln(\mathbf{1} - s(\mathbf{Xw}))$, where $\mathbf{y}$ is an n-vector, $\mathbf{X}$ is an (n x d) matrix, $\mathbf{w}$ is a d-vector, $s_i = s(\mathbf{x_i}\cdot\mathbf{w})$, and $s(a) = \frac{1}{1+e^{-a}}$ is the logistic function. I only want to compute this in terms of matrix-vector components, so breaking each part down into individual components isn't what I ultimately want. (It's fine if we do so in order to compute the appropriate matrix-vector component, as long as we ultimately end up with a matrix-vector component). The following is my attempt so far:
$$\nabla_wJ(\mathbf{w}) = -\nabla_w[\mathbf{y}\cdot ln(s(\mathbf{Xw}))] - \nabla_w[(\mathbf{1}-\mathbf{y})\cdot ln(\mathbf{1} - s(\mathbf{Xw}))] = \\ -\nabla_w[\mathbf{y}]ln(s(\mathbf{Xw}))-\nabla_w[ln(s(\mathbf{Xw}))]\mathbf{y} - \nabla_w[(\mathbf{1}-\mathbf{y})]ln(\mathbf{1}-s(\mathbf{Xw}))-\nabla_w[ln(\mathbf{1}-s(\mathbf{Xw}))](\mathbf{1}-\mathbf{y}) = \\ -\nabla_w[ln(s(\mathbf{Xw}))]\mathbf{y}-\nabla_w[ln(\mathbf{1}-s(\mathbf{Xw}))](\mathbf{1}-\mathbf{y}) = \\ -[\nabla_w[s(\mathbf{Xw})]\nabla_{s(\mathbf{Xw})}ln(s(\mathbf{Xw}))]\mathbf{y}-[\nabla_w[\mathbf{1}-s(\mathbf{Xw})]\nabla_{\mathbf{1} - s(\mathbf{Xw})}ln(\mathbf{1}-s(\mathbf{Xw}))](\mathbf{1}-\mathbf{y}) = \\ -[\nabla_w\mathbf{s}\nabla_{\mathbf{s}}ln(\mathbf{s})]\mathbf{y}-[\nabla_w(\mathbf{1} - \mathbf{s})\nabla_{\mathbf{1}-\mathbf{s}}ln(\mathbf{1}-\mathbf{s})](\mathbf{1}-\mathbf{y}),$$ where $\mathbf{s} = s(\mathbf{Xw}),$ giving $$ \\ = -[\nabla_w\mathbf{Xw}\nabla_{\mathbf{Xw}}s(\mathbf{Xw})\nabla_sln( \mathbf{s})]\mathbf{y} - [-\nabla_w\mathbf{Xw}\nabla_{\mathbf{Xw}}s(\mathbf{Xw})\nabla_{\mathbf{1}-\mathbf{s}}ln(\mathbf{1}-\mathbf{s})](\mathbf{1}-\mathbf{y}).$$ Now, if we put $\mathbf{u} = \mathbf{Xw}$, we have $$ \\ = -[\nabla_w\mathbf{u}\nabla_us(\mathbf{u})\nabla_sln(\mathbf{s})]\mathbf{y} + \nabla_w\mathbf{u}\nabla_us(\mathbf{u})\nabla_{\mathbf{1}-\mathbf{s}}ln(\mathbf{1}-\mathbf{s})(\mathbf{1}-\mathbf{y}). $$ This is as far as I can get, now I need to solve. Any help would be greatly appreciated--this matrix calculus stuff seems a bit strange to me.
Define some auxiliary variables. $$\eqalign{ a &= Xw &\implies A = \operatorname{Diag}(a) \\ s &= \big({\tt1}+\exp(-a)\big)^{-1} &\implies S = \operatorname{Diag}(s) \\ ds &= (I-S)S\,da \\&= (I-S)SX\,dw \\ }$$ Write the cost function in terms of these variables.
Then calculate its differential and gradient. $$\eqalign{ {\cal J} &= (y-{\tt1}):\log(1-s) - y:\log(s) \\ d{\cal J} &= (y-{\tt1}):d\log(1-s) - y:d\log(s) \\ &= ({\tt1}-y):(I-S)^{-1}ds - y:S^{-1}ds \\ &= \Big((I-S)^{-1}({\tt1}-y) - S^{-1}y\Big):ds \\ &= \Big((I-S)^{-1}({\tt1}-y) - S^{-1}y\Big):\big((I-S)SX\,dw\big) \\ &= \Big(S(I-S)(I-S)^{-1}({\tt1}-y) - (I-S)SS^{-1}y\Big):X\,dw \\ &= \big(S({\tt1}-y) - (I-S)y\big):X\,dw \\ &= \big(s-Sy-y+Sy\big):X\,dw \\ &= X^T(s-y):dw \\ \frac{\partial{\cal J}}{\partial w} &= X^T(s-y) \\ }$$
In the above the $($Logistic, $\exp$, $\log)$ functions are applied elementwise, $I$ is the identity matrix,
${\tt1}$ is the all-ones vector, and the colon represents the trace/Frobenius product, i.e. $$\eqalign{ A:B &= \operatorname{Tr}(A^TB) }$$ The cyclic property of the trace allows the terms in such products to be rearranged in a variety of equivalent ways, e.g. $$\eqalign{ A:B &= B:A \;=\; B^T:A^T \\ CA:B &= A:C^TB \;=\; C:BA^T \;=\; \ldots \\ }$$ The derivative of the logistic function in scalar form is $$\frac{ds}{da} = (1-s)s \quad\implies ds = (1-s)s\,da$$ The elementwise vector form is $$ds = ({\tt1}-s)\odot s\odot da$$ where $\odot$ is the elementwise/Hadamard product.
But Hadamard products of vectors can be replaced by standard products with their diagonal matrices, resulting in the simple form at the top of this post.
The differential of the elementwise log function is similarly derived. $$d\log(s) = \frac{ds}{s} \quad\implies ds = s\odot d\!\log(s) = S\,\,d\!\log(s)$$