Having the multinomial logistic loss defined as:
$$ L(z; y=j) = -\log[\operatorname{softmax}(z)]_j $$ with, $$[\operatorname{softmax}(z)]_j = \frac{\exp(z_j)}{\sum^K_{k=1} \exp(z_k)}$$
How can I compute the gradient and the Hessian of L with respect to z?
Right now I have the following for the gradient of $L$ with respect to $z$:
$$ \begin{array}{ll} s_i & i \neq j \\ (s_j-1) & i = j \end{array} $$
And so, from my understanding, computing the Hessian will give me the following:
$$ \begin{array}{ll} 0 & i \neq j \\ 1 & i = j \end{array} $$
Leaving me with the Identity matrix.
What I need help with, is to understand if these values are correct or not.
They don't look correct, to say the least.
Thank you.
Your notation (probably derived from a statistics books) is horrible (as most statistics books).
In mathematical terms: let $\pi_k:\mathbf{R}^d \to \mathbf{R}$ denote the $k$th projection $\pi_k(z) = z_k.$ You want to find the first and second derivative of $L(z) = - \log\left( \dfrac{\exp(\pi_j(z))}{\sum\limits_{k = 1}^d \exp(\pi_k(z))} \right).$ Let me first rewrite $L(z)$ as follows: $$ L(z) = \log \left( \sum\limits_{k = 1}^d \exp(\pi_k(z)) \right) - \pi_j(z). $$ Next, notice that each $\pi_k$ is a linear function, which implies that its derivative at any point is just itsef: $\pi_k'(z) = \pi_k.$ Then, $$ \begin{align} L'(z) &= \log'\left( \sum\limits_{k = 1}^d \exp(\pi_k(z)) \right) \sum\limits_{k = 1}^d \exp'(\pi_k(z)) \pi_k - \pi_j \\ &=\dfrac{\sum\limits_{k = 1}^d \exp(\pi_k(z)) \pi_k}{\sum\limits_{k = 1}^d \exp(\pi_k(z))} - \pi_j. \end{align} $$ It is convenient to write this in matrix terms. Note that for any $k,$ $\pi_k(z) = z_k = c_k^\intercal z,$ where $c_k$ is the $k$th canonical vector in $\mathbf{R}^d.$ So that $\sum\limits_{k = 1}^d \exp(\pi_k(z)) = \sum\limits_{k = 1}^d e^{c_k^\intercal z}.$ Thus, the matrix representation of $L$ (relative to the canonical bases of $\mathbf{R}^d$ and $\mathbf{R}$) is $$ L'(z) = \dfrac{\sum\limits_{k = 1}^d e^{c_k^\intercal z} c_k}{\sum\limits_{k = 1}^d e^{c_k^\intercal z}} - c_j. $$ The second derivative of $L$ is now obvious (just differentiate the previous expression relative to $z$): $$ L''(z) = \dfrac{ \left( \sum\limits_{k = 1}^d e^{c_k^\intercal z} c_k c_k^\intercal \right) \sum\limits_{k = 1}^d e^{c_k^\intercal z} - \left( \sum\limits_{k = 1}^d e^{c_k^\intercal z} c_k \right) \left( \sum\limits_{k = 1}^d e^{c_k^\intercal z} c_k^\intercal \right) }{ \left( \sum\limits_{k = 1}^d e^{c_k^\intercal z} \right)^2 } $$ For a given $z,$ write $s_k = s_k(z) = e^{c_k^\intercal z} = e^{z_k},$ and denote by $s$ the sum of the $s_k.$ Then $$ L''(z) = \dfrac{\sum\limits_{k = 1}^d ss_k c_k c_k^\intercal - \sum\limits_{1 \leq a, b \leq d}^d s_as_b c_ac_b^\intercal}{ s^2 } $$ Note that $c_ac_b^\intercal$ is a square matrix of order $d$ filled with zeroes except entry $(a, b)$ where you have a one. Thus, $$ [L'(z)]_i = \dfrac{s_i}{s} - \delta_{i,j} $$ and $$ [L''(z)]_{a,b} = \delta_{a,b} \dfrac{s_a}{s} - \dfrac{s_a s_b}{s^2}. $$