This is a sequel to a previous question about implementing binary logistic regression from scratch.
Background knowledge:
To train a logistic regression model for a classification problem with $K$ classes, we are given a training dataset consisting of feature vectors $x_1, x_2, \ldots, x_N \in \mathbb R^d$ and corresponding one-hot encoded label vectors $y_1, y_2, \ldots, y_N \in \mathbb R^K$. If example $i$ belongs to class $k$, then the label vector $y_i$ has a $1$ in the $k$th position and zeros elsewhere. Our goal is to find vectors $\beta_1, \beta_2, \ldots, \beta_K \in \mathbb R^{d+1}$ such that $$ \tag{1} y_i \approx S\left(\begin{bmatrix} \hat x_i^T \beta_1 \\ \vdots \\ \hat x_i^T \beta_K \end{bmatrix} \right) \quad \text{for } i = 1, \ldots, N. $$ Here $\hat x_i \in \mathbb R^{d+1}$ is the "augmented" feature vector obtained by prepending a $1$ to the feature vector $x_i \in \mathbb R^d$, and $S:\mathbb R^K \to \mathbb R^K$ is the softmax function defined by $$ S(u) = \begin{bmatrix} \frac{e^{u_1}}{e^{u_1} + \cdots + e^{u_K}} \\ \vdots \\ \frac{e^{u_K}}{e^{u_1} + \cdots + e^{u_K}} \end{bmatrix} \quad \text{for all vectors } u = \begin{bmatrix} u_1 \\ \vdots \\ u_K \end{bmatrix} \in \mathbb R^K. $$ The softmax function is useful in machine learning because it converts a vector into a "probability vector" (that is, a vector whose components are nonnegative and sum to $1$). Equation (1) can be expressed more concisely using vector and matrix notation as follows. Define $$ \beta = \begin{bmatrix} \beta_1 \\ \vdots \\ \beta_K \end{bmatrix} \in \mathbb R^{K(d+1)} \quad \text{and} \quad M_i = \begin{bmatrix} \hat x_i^T & & & \\ & \hat x_i^T & & \\ & & \ddots & \\ & & & \hat x_i^T \end{bmatrix} \in \mathbb R^{K \times K(d+1)}. $$ Then equation (1) says that we hope that $$ y_i \approx S(M_i \beta) \quad \text{for } i = 1, \ldots N. $$
I think of $S(M_i \beta) \in \mathbb R^K$ as being a "predicted probability vector" which tells you how strongly our model believes that example $i$ belongs to each of the $K$ classes. And $y_i$ is a "ground truth" probability vector which reflects certainty about which of the $K$ classes example $i$ belongs to.
We will use the cross-entropy loss function $$ \ell(p,q) = \sum_{k=1}^K -p_k \log(q_k) $$ to measure how well a predicted probability vector $q \in \mathbb R^K$ agrees with a ground truth probability vector $p \in \mathbb R^K$. Here $\log$ denotes the natural logarithm. The vector $\beta$ will be chosen to minimize the average cross-entropy $$ \tag{2} L(\beta) = \frac{1}{N} \sum_{i=1}^N \ell(y_i, S(M_i \beta)). $$ We can minimize $L(\beta)$ using an optimization algorithm such as gradient descent, stochastic gradient descent, or Newton's method. In order to use such methods, we need to know how to compute the gradient of $L$. For Newton's method, we also need to know how to compute the Hessian of $L$.
Question:
How can we compute the gradient and the Hessian of the average cross-entropy function $L(\beta)$ defined in equation (2)?
The goal is to compute the gradient and the Hessian of $L$ with finesse, making good use of vector and matrix notation.
(I'll post an answer below.)
Notice that $$ L(\beta) = \frac{1}{N} \sum_{i=1}^N h_i(M_i \beta) $$ where $h_i: \mathbb R^K \to \mathbb R$ is the function defined by $$ h_i(u) = \ell(y_i, S(u)). $$ The softmax function and the cross-entropy loss function are natural companions, and they are happy to be combined together into the function $h_i$. By the chain rule, $$ L'(\beta) = \frac{1}{N} \sum_{i=1}^N h_i'(M_i \beta) M_i. $$ If we use the convention that the gradient is a column vector, then we have $$ \nabla L(\beta) = L'(\beta)^T = \frac{1}{N} \sum_{i=1}^N M_i^T \nabla h_i(M_i \beta). $$ So, to compute $\nabla L(\beta)$, we only need to figure out how to evaluate $\nabla h_i(u)$. We are nearly done already.
Before we compute $\nabla h_i(u)$, let's first simplify $h_i(u)$ as much as possible. The components of the label vector $y_i \in \mathbb R^K$ will be denoted $y_{i1}, \ldots, y_{iK}$. Note that $y_{i1} + \cdots + y_{iK} = 1$. Doing some high school algebra and using some properties of logarithms, we find that \begin{align} h_i(u) &= \sum_{k=1}^K - y_{ik} \log\left(\frac{e^{u_k}}{e^{u_1} + \cdots + e^{u_K}}\right) \\ &= \sum_{k=1}^K -y_{ik} u_k + y_{ik} \log(e^{u_1} + \cdots + e^{u_K}) \\ &= -\langle y_i, u \rangle + \sum_{k=1}^K y_{ik} \underbrace{\log(e^{u_1} + \cdots + e^{u_K})}_{\text{common factor}} \\ &= - \langle y_i, u \rangle + \log(e^{u_1} + \cdots + e^{u_K})\underbrace{\sum_{k=1}^K y_{ik}}_1 \\ &= \underbrace{\log(e^{u_1} + \cdots + e^{u_K})}_{\text{LogSumExp function}} - \langle y_i, u \rangle. \end{align} Look how much $h_i(u)$ simplified! And look how beautiful it is that our old friend the LogSumExp function has appeared. This goes to show that the softmax function and the cross-entropy loss function are meant to be together. This is why PyTorch's loss function torch.nn.functional.cross_entropy takes "logits" as input. (In other words, PyTorch's cross_entropy loss function assumes that you have not yet applied the softmax function.)
Now we are ready to compute the gradient. It can be shown that the gradient of the LogSumExp function is the softmax function, which is a beautiful fact. It follows that $$ \nabla h_i(u) = S(u) - y_i. $$ This is a beautiful and delightfully simple formula. Interpretation: If the predicted probability vector $S(u)$ agrees perfectly with the ground truth probability vector $y_i$, then $\nabla h_i(u)$ is zero, suggesting that no change to the value of $u$ is necessary.
Putting the above pieces together, we see that $$ \nabla L(\beta) = \frac{1}{N} \sum_{i=1}^N M_i^T (S(M_i \beta) - y_i). $$
We are now done with computing the gradient of $L$, but it's helpful to make one further observation. Let's look more closely at the $i$th term in the sum above, which we shall call $g_i$. The vector $\beta$ has block structure, since it is the concatenation of the vectors $\beta_1, \ldots, \beta_K$. Let $S_k$ be the $k$th component function of $S$. Notice that $$ g_i = M_i^T(S(M_i \beta) - y_i) = \underbrace{\begin{bmatrix} \hat x_i & & & \\ & \hat x_i & & \\ & & \ddots & \\ & & & \hat x_i \end{bmatrix}}_{K(d+1) \times K} \underbrace{\begin{bmatrix} S_1(M_i \beta) - y_{i1} \\ S_2(M_i \beta) - y_{i2} \\ \vdots \\ S_K(M_i \beta) - y_{iK}\end{bmatrix}}_{K \times 1}. $$ We can see that the $k$th block of $g_i$ is $(S_k(M_i \beta) - y_i) \hat x_i$. This is a very nice formula. Interpretation: If the predicted probability $S_k(M_i \beta)$ agrees perfectly with the ground truth probability $y_i$, then the $k$th block of $g_i$ is $0$, suggesting that no change needs to be made to $\beta_k$.
The above observation makes life easier when implementing multiclass logistic regression from scratch in Python. Due to the block structure of the vectors $\beta$ and $g_i \in \mathbb R^{K(d+1)}$, it is convenient to store $\beta$ and $g_i$ as Numpy arrays with shape $(d+1) \times K$. Then $M_i \beta$ can be computed using the Numpy command
MiBeta = beta.T @ xiHat. And $g_i$ can be computed using the Numpy commandgi = np.outer(xiHat,S(MiBeta) - yi). This makes it possible to implement multiclass logistic regression from scratch in Python very concisely, particularly when using stochastic gradient descent with a batch size of 1.Next let's compute the Hessian of $L$. The Hessian matrix $H L(\beta)$ is the derivative of the function $$g(\beta) = \nabla L(\beta) = \frac{1}{N} \sum_{i=1}^N M_i^T (S(M_i \beta) - y_i). $$ To compute the derivative of $g$, it is helpful to know that $$ S'(u) = \text{diag}(S(u)) - S(u) S(u)^T, $$ which is analogous to the formula for the derivative of the sigmoid function. Again using the chain rule, we find that the derivative of $g$ is \begin{align} g'(\beta) &= \frac{1}{N} \sum_{i=1}^N M_i^T S'(M_i \beta) M_i. \end{align} If we look closely at this formula, it probably simplifies nicely, but I still need to work out the details.
So we have found our Hessian matrix $HL(\beta) = g'(\beta)$. Having computed the gradient and Hessian of $L$, it is now straightforward to train a multiclass logistic regression model from scratch using gradient descent or Newton's method in a language such as Python.